This content introduces cluster sampling as a statistical technique, distinguishes it from stratified sampling and simple random sampling, and discusses potential sources of bias in data collection.
Mind Map
Click to expand
Click to explore the full interactive mind map • Zoom, pan, and navigate
okay so this is the last part of our
first section of the school year end of
section 4.1 and we're going to learn
about another type of sampling technique
in this section so the new type of
technique is called a cluster sample and
let me go ahead before you guys get like
lost and reading my definition and
everything and walk you through a little
hypothetical example let's say I had a
big apartment complex with a lot of
units in it so it's like a lot of
separate buildings going on so I've got
all these little buildings in my
apartment complex right here and I want
to know what residents think about
raising the rents a little bit to build
a new pool so I wanna build a super cool
swimming pool right here and everybody's
gonna love it it's gonna be grades but I
got a raise rent a little bit to be able
to buy the pool so um I could go and
talk to every single resident and if I
didn't own that many buildings that
probably honestly is the most favorite
thing to do but let's say it's just a
big complex and I don't have that kind
of time so I'm just gonna take a random
sample instead I could do an SRS just
number off every person who is paying
rent to me and then choose randomly that
way and that would be great
I could also do some sort of a
stratified sample a good variable to
stratify by might be income if I pick
somebody has a lot of income maybe
they're willing to have a pool put in
where if they're not making as much that
may not be something that's worth the
expense for them so if I somehow add
access to their income levels I could
stratify by that get people at the high
income group medium medium low whatever
and pick some from each group and that
would be fine on a lot of times
stratified samples though are more
difficult to execute
compared to other sampling techniques so
like let's say I did do that by income I
might have to go over to this apartment
talk to that floor right there then go
all the way over to here and then go to
there and go to there and get all these
different apartments the same thing
would actually also happen in SRS I'd be
like running all over my complex talking
to these different people so they
introduced another valid sampling
technique called a cluster sample
and what a cluster sample does is it
breaks population up populations of all
my apartments residents here it breaks
them up into groups that are called
clusters based on proximity or something
that's already pre established so for
example in my apartment complex each
building is a nice little cluster
already um there are already bunch of
people in here a bunch of people in here
a bunch of people in here and I can
treat these like my separate units in my
problem now cluster sampling says that
what you do is instead of numbering off
people you're not gonna number off every
single apartment in there you just do
the building itself that's building one
building to building three building four
etc and you randomly choose your numbers
so you're gonna choose I chose number
three number seven number eleven those
are my buildings so what I would do once
I get my numbers is I would go and I
would use everybody in that building for
my sample everybody in that building for
my sample in a stratified sample I take
a little bit out of each separate group
in a cluster sample the randomness
happens from picking the group's
themselves but once I have my groups I
use everybody inside of that group this
can be a good thing
cluster sampling is good in the sense
that it's easier people are already in
buildings if I own this giant complex I
don't have to run around everywhere I
just go to the buildings that I chose
and use those people it doesn't work out
very well though if the buildings are
different in some fashion so cluster
sampling kind of assumes that this
building is the same as this building is
the same as this building one building
is as good as any other basically and
weirder for this to work well your
clusters need to represent many
populations so in order for this to be
okay so
talk about some ways that my little
apartment the example wouldn't be so
great what if I had a bunch of fancier
apartments like luxury apartments in one
of the buildings like say building
number one was like really fancy
apartments and I chose that building for
my sample well if this building is
different than the other ones those
people will think differently and that's
not good I'm gonna get biased in my
results or possibly or certainly
increased variability between different
samples so if you have buildings that
are different it won't work out so great
every building is supposed to be about
the same if I would have thought that Oh
each building is different I could pick
some from each and do a stratified
samples so cluster sampling works if the
buildings are more or less similar to
one another why do we do it I've already
talked about this it's usually more
efficient and easier to execute compared
to a stratified sample it's easier to
walk into one building and talk to
everybody that it is to run around and
do a little bit of each keeping these
two techniques straight is gonna be
really important for you guys um so I
have a little thing that I use little
phrase I want you mister write down so
stratified versus cluster so a
stratified sample is going to be similar
with in and then difference between in
my example of that was my freshman my
sophomores my juniors my seniors with
you I don't know where are there s are
at senior each bubble as freshmen
probably feel about the same as
sophomores feel about the same juniors
feel about the same as each other
so all juniors think somewhat similarly
about the prom thing but juniors
compared to freshmen probably have
different opinions so the bubbles
themselves are very different but within
the bubble you're saying they're more or
less the same a cluster is going to be
the opposite of that's so clusters are
different within you're going to get
people from all walks of life
within so in the apartment building
you'll have some people with higher
income someone with lower presumably
but they're similar between both of
these are valid good sampling techniques
that are if used correctly can be better
than an SRS but um they take thoughts to
execute properly if you do a bad job
thinking through either these techniques
it's not gonna work out for you so I
have a big example to talk through the
different techniques here um and we're
looking at a library this library is
really big it has 20,000 books in it and
for some reason this library isn't
organized by section like by type of
book it's just a through Z so we got
like big ol shelves with the books
starting with authors a all the way
through Z first thing they ask us to do
is to just like describe in detail how
you would do an SRS and I have these
written out so I want to talk you
through this but you cannot ultimately
just write what I have right here you
can make a little paragraph like I did
or you can do like a list like that it's
up to you either way you can abbreviate
you can do whatever but there are
problems that ask you to talk in detail
about what you're going to do kids tend
to find these a little bit annoying
because you have to write a lot and then
you have to be fairly specific but in
general when they ask how to explain a
sample you need to make sure that you
are going to be super detailed so
explain how to select a simple random
sample detail is gonna be so key in this
class if there's one thing I write for
you want feedback again and again again
over the course of the year it's going
to be detail you need to make sure
you're being very specific and how you
use it is so talking through my library
I have 20,000 books the first thing I
should do is I should label each book
and give them a number so you can see
right here that I said assign each book
a unique number from 1 to 20 thousands
that word unique is key in this problem
it's kind of like a stupid thing but
like if I just say oh give everybody a
number from 1 to 20,000 what if they use
numbers repeatedly oh that's one that's
one that's one by adding in the word
unique you're making sure they're not
being like ridiculous and how they are
setting up their tables right here so
the first thing you do when you're asked
to write one of these problems is you
talk about labels you need to make sure
you specify they're unique
then you need to talk about how to get
your random numbers this is grand
numbers right there so I'm gonna use a
calculator I'm gonna use the command R
and n so 1 comma 20,000 s and I'm gonna
select 500 unique numbers again I use
the word unique it's a great word for
this because it means you don't repeat
things but even so you need to be very
explicit and I just go ahead and say
reject repeats or throw out repeated
numbers so you also need when you do one
of these problems to talk about what
happens to read beats and I don't want
to repeat a book when I'm making my
sample and finally you're gonna stop
talk about what you would actually do
your action I am going to look at the
number of pages in each book so writing
these out to be specific enough you have
to do these four things it's kind of a
lot and it's a little bit annoying to
write it out I sympathize with you there
I don't super love writing them in
detail but the AP test will do this to
make sure you truly understand what it
takes to collect rain all right so
second thing is a stratified sample in a
stratified sample in the same context
you would have to think about a variable
that measures from what makes a
difference in how long a book is and one
thing that they kind of fed you in the
problem there was they mentioned the
word genre and put that in your head it
seems plausible to me like think about
like a comic book or a magazine that
type of genre is gonna have a different
amount or even like a little young adult
fiction novel compared to a big history
textbook or a big encyclopedia or
something like that so the type of book
presumably makes a difference in how
many pages it has so what I could do
them instead of one big list I would
take all my history books number them
off to one to whatever all my fiction
books 1 2 whatever all my I don't know
whatever kind of books there are and I
would pick some randomly out of each
group now when you do astray
example you don't always have to do the
same allowance in each group let's say
when I'm doing this problem I have like
I know tons and tons of fiction books
but my little encyclopedia pile is
actually smaller you can pick them with
print look in proportion to the actual
population like let's say for example
50% of the books in this library are
fiction and only 10% are encyclopedias I
could make it so 10% of my sample is
encyclopedias and 50% is fiction
so you can pick with respect to the
population that's totally cool but in
this problem a good variable to stratify
by would probably be the genre other
things kids have thrown my way over the
years um where they are on the shelves
presumably you're not putting a giant
massive things like this way on the top
shelf it's somebody in the head so
somebody said couldn't you do like the
bottom shelf as one strata and the next
shelf in the next shelf that's a
possibility - there's a lot of things
you can do to stratify by genre was just
the first most obvious thing I can think
of so let's keep it going then and talk
about the same scenario with a cluster
sample the books are set up on shelves
of 50 and the way that I made my library
is maybe unrealistic because I said my
library is just straight-up alphabetical
not separated into genre like most
libraries are if it's straight up of
medical each shelf is probably one about
as good as the up that's not how most
libraries are like if I had all my
encyclopedias over here a cluster sample
would be a bad thing because this is not
the same as this but assuming that oh
yeah they are all about the same
if the shelves are similar than all I
would do is I would number off my
shelves 1 2 3 4 etc I would do a random
sample and pick my shelves and then once
I got my shelves I would talk to you
look at every book in that shelf so
that's how you could incorporate a
cluster sample so what is a drawback to
each method I'm talking specifically I
want to focus on the SRS is fine
SRS works well but the SRS and is
stratified in my library are making me wrong
wrong
the place to go find all those books
this cluster is going to be the easiest
of the three to execute because um I can
just go to one shelf and sit there and
like look at all of those books so the
cluster frequently going to be more
efficient still works well if the
clusters are legit but if the
bookshelves actually are different and
I'm making a bad assumption I can get
bad data that way so let's move on and
talk about this slide so this is a new
context right here now we're looking at
kids at a university and we want to know
what percent of students we're at class
every single day so what percent of
students didn't skip any of their
classes and the only difference between
these two pictures each dot is going to
be like a separate sample so it doesn't
say oh is it does it says I'm talking to
100 people that I'm talking to 400
people um so what this means because
this is honestly this simulation stuff
right here is one of the hardest parts
of all of ap stats we hit it again and
again and again in my first picture what
is going on is I have 500 dots in this
picture each of those dots is a separate
sample of 100 kids so I am going to
survey a hundred kids for my sample we
will get this dot right here at like
point 80 that will be a sample where I
took a hundred kids and 80 percent of
them went to class and then I would
throw those kids back in and pick
another sample of 100 kids and that time
oh gosh only 75% said yes and then the
next time I go over here I got 67% etc
etc etc each of those dots is a separate
group of a hundred kids and I do that
there are 500 dots in that picture
similarly in my second picture right
here still each of these dots is a
separate sample but they changed it so
now this is a sample of 400 kids so each
dot is a group of 400 kids instead of a
hundred kids like over here still 500
dots but it's a different shape to the
picture when you look at it it's
different spread anyway
so let's talk about what's going on here
when your sample size gets bigger the
whole thing this slide right here is
we've talked about the word variability
already um it's just talking about how
spread out things are the data tends to
be more consistent when your sample size
is bigger so let's say at the college
it's actually where 70% of kids went to
class every single day and it actually
says that the problem 70% of kids at
this school actually go to class without
skipping at every single day if I talk
to a hundred kids it could be where I
get a fluky sample where wow I got a lot
of really motivated kids here and I
happen to get like 83 of them who went
to school every single day it could also
be where I have a group of 100 and oh
less than usual I only got like 56% over
here so there's a lot more spread going
on in this picture than there is over
here if you get a sample of 400 you're
not gonna have as many of those fluky
things like if the answer is actually 70
getting an 85 isn't gonna happen if you
take a bigger sample because even if you
get like some kids that oh yeah I do go
to class you also get more that don't
and it kind of balances out so this note
right here is a really important one
when your sample is bigger your data
becomes more consistent if I take a
sample of 400 and you take a sample of
400 we're probably gonna get closer to
the same answer we will hit this again
and again and again as we go through
this class so no worries if that's
though a little shaky right now so let's
see what else we've got here remind
myself how much more I have all right
couple vocab words here inference is
basically all of second semester of ap
stats we're building up little pieces of
it now we really hit this hard second
semester when we talk about confidence
interval significance testing and so on
also inference means
he's using your sample to talk about the
whole population using results from a
little sample to apply to the whole
population that's what inference minutes
when you collect results you're not
always going to get the exact answer
that's called they did like you may be
off by a little bit just by chance when
I looked at my last slide right here um
the answer was 70% but I don't get
exactly 70 every time that we do it I
could be a little higher could be a
little lower the amount that I think I
could be off by the maximum amounts is
called the margin of error so if you
look at this problem it looks like most
of my data points are within five
percents of 70 whereas over here on the
second picture you can see it's a lot
more spread out it's more like 10% each
way or even a little higher than that so
the margin of error is lower on this
picture in all margin of error really
means is the maximum you think your
answer might be off line and when you
increase sample size it makes your
estimates more precise we've talked
about this already it decreases
variability between samples so let's go
ahead and talk about a few other things
with regards to simply that last part
rate with the pictures was a little bit
confusing this isn't so bad we want to
talk about a few other things that can
go wrong after you've selected or when
you've decided you're gonna select your
sample so okay I've convinced you that
your sample should be random
what happens now first one is called
under coverage and under coverage means
that some individuals were not actually
able to be chosen for your population
okay so some people were not able to be
chosen classic example there's a famous
newspaper picture that you might have
seen from like the 1940s
um it says Dewey defeats Truman it was
for the presidential race Dewey defeats
Truman Harry Truman is a presidents that
you guys have probably heard of I don't
even know doobies first name john maybe
um Thomas I'm not sure whoever he is
I've really thought he was gonna win it
was the night before the election and
the results weren't in yet but
newspapers back then had to print earlier
earlier
than they otherwise like before the
results were final but they felt so good
from their surveys that they actually
went ahead and said oh yeah doobies
totally gonna win and publish this
headline and it's picture of truman like
laughing looking at it after he actually
won the election what had happens there
were a couple of reasons but basically
one of them was they did a lot of their
surveying by phone so they called people
in this said hey who are you gonna vote
for well back in the 1940s having a
phone was that fancy thing so they ended
up talking to a lot more wealthy people
who were in favor of dewey and the
people who without phones who tended to
have less money voted for truman and
pushed him over the top and he ended up
winning that's under coverage because
some people people without phones
weren't able to be contacted that
happens all the time like if you do an
internet survey some people still don't
have access to the internet so you're
not gonna hear as much from people in
rural areas older people as a general
rule or people who can't afford to pay
for the regular internet access so
certain groups get left out if you're
not careful on where your data comes
from that's called under coverage the
next thing on here is called
non-response and non-respondents is when
you choose your sample random so you've
chosen your sample everything is cool
you contact people hey be in my sample
and they either say no or you can't get
in touch with them okay so this happens
a lot on we do a semester projects
related to sampling and people will like
email kids okay you've been chosen to be
in my survey and the kids like I don't
want to do that they ignore it that is
an issue because the kids who ignore
your survey may feel differently than
kids who respond so you still run the
issue of bias even though you did your
sample randomly this sounds a lot like
voluntary response which we talked about
earlier in this section the key
difference is that voluntary response
you just put out in general hey be and
my study if you want to you didn't do
anything random so voluntary response
there was no randomness you just let
people invite themselves non-response
occurs after you've selected your random
sample so you
the right thing and you randomly sampled
but then people didn't respond so what
would you do in that scenario in real
life if that happens well you try
contacting them again and try to get
them to follow up and be in the survey
and if they wouldn't you could randomly
pick somebody else but you can put like
a disclaimer that this many people
didn't respond because those people
could actually feel differently these
two things right here and many others
are examples of response bias response
bias is a general catch-all term for
things that might happen or that you
might do that might influence the
results or responses that you get so
response bias is as opposed to sampling
bias so your two main types of bias are
response bias and sampling bias actually
these two things right here I misspoke
and there you go those aren't really
response bias I'll give you some
examples of things that are so sampling
bias is bias and how you collected your
data in the first place response bias
can happen even if your results are your
results came from a random saying let me
give you a few quick examples the
wording on the question can lead people
to respond to a certain way so if I
think everybody at mrh to take four
years of math and set of three I can be
like hey given that colleges want to see
four years of math and math is an
essential tool for being great in life
and etcetera etc do you think we should
do four years of math instead of three
it doesn't matter if I was actually
random and who I talk to if you can tell
that I want it you may say yes just to
agree with me even if you don't actually
believe that yourselves so the way that
you were to question can matter we
actually do our first semester project
on response bias so this is something
you guys will get to practice on later
on in the semester one thing that kids
did that um a good project from a couple
of years ago they had a Google survey
about doesn't even really matter what it
was just a random survey and they had
the big textbox option and the little
text box option and they just wanted to
see if people would type more in the big
textbox and they actually did so
something as simple as the size of the
text box or the color that your question
is written or even the order of the
questions can influence responses I did
a little bit of that with you guys um in
the survey I gave you I played around
with the order of the questions and did
some things that counts his response
bias which we'll investigate later on
not in this video here um one more
really good example and then I'll move
on there was a survey where basically
people they asked people like how happy
are you with life overall that was like
one question they had you rate it so how
happy are you with life and then
followed it up with how many dates have
you been on in the past month and then
they reverse the order of the questions
and asked about the dates first and got
lower responses for the happiness with
life in general so something as simple
as like what ordered us things in can
make a difference in responses so let's
close this out by talking about these
couple of examples right here again
practicing vocabulary if you're not ever
sure on a vocab word don't say the word
and guess just describe it and that will
be fine
this is not so much of a buzz word class
where you have to know oh that's
voluntary response or non-response or
whatever but let's go through these if
you choose your sample out of a
telephone book that is going to be under
coverage there are people who do not
have a phone number in the book and you
won't be able to contact them as a
results so you finally get a hold of
your sample you choose your sample but
some people can't be don't return calls
that's non-response when they choose not
to be a part of your study right there
and then a few years just walking people
walking by on the sidewalk that is a
voluntary response so those are three
things that are all bad sampling
techniques alright this last question
and then we're done with this video here
I'm not gonna write this one but I'm
just going to talk through this they
wants to figure out they found that 84%
of people opposed banning disposable
diapers and then explain how the how
this like leads to bias anytime they
talk about bias you have to do those two
things what's the problem and what direct
direct
does this problem probably some things
so this person says is estimated diaper
disposable diapers are less than 2% less
than 2% of the trash and landfills in
contrast beverages are 21% of the trash
given that it's only 2% would it be fair
to ban disposable diapers so people are
gonna look at that like Oh 2% that's not
very much those drinks are way more so
no we shouldn't ban diapers that's ridiculous
ridiculous
this question right here is very much
pushing people towards the conclusion
that having a such a small percentage
banning it isn't going to make much of a
difference so the wording of this
question is very much pushing people to
say that no it's not that big of a deal
the wording suggests it's not a big deal
so people are gonna say oh don't ban
them and they're pushed in that
direction more than if they had asked
the question and just innocently and
neutrally so the actual percentage of
people who are in opposition this is
probably overestimating that it's
probably higher than the actual answer
of people who oppose so that would be a
good way of describing the bias in that question
Click on any text or timestamp to jump to that moment in the video
Share:
Most transcripts ready in under 5 seconds
One-Click Copy125+ LanguagesSearch ContentJump to Timestamps
Paste YouTube URL
Enter any YouTube video link to get the full transcript
Transcript Extraction Form
Most transcripts ready in under 5 seconds
Get Our Chrome Extension
Get transcripts instantly without leaving YouTube. Install our Chrome extension for one-click access to any video's transcript directly on the watch page.