whether a data sample size can ever be large enough to compensate for a lack of diversity within the sample?

674 views

I’m doing some research for an unrelated piece and came across this idea, but because I am not really proficient in stats, I don’t know how to refute it. But I feel I could understand if explained to me logically.

I’d also be curious to know if there are analytic concerns with extremely large sample sizes.

In: Mathematics

4 Answers

Anonymous 0 Comments

So, your sample group is biased and you want to know if further sampling of said group will yield less biased results?

No. That will not yield more accurate results, merely further cement and refine the existing biased results.

Let’s split the population into 2 groups for an example. People who own or have access to cars, and those that don’t.

Will asking for input related to the necessity of public transportation of the group that own cars ever reflect the views of those that don’t, even if we ask every single one of them? No, it’s not likely that they will, they’re more likely to think it’s not necessary because they can drive, where as those that can’t are likely to think of it as an absolute necessity. Now, they may share the same views but we can’t be confident that they do and thus we must acknowledge that this data is biased. We won’t get an accurate representation without an accurate sampling.

Anonymous 0 Comments

Yes and no.

If you’ve got a sample of 500 Democrats and 50 Republicans, that’s not going to reveal different information about electoral outcomes than a sample of 500,000 Democrats and 50,000 Republicans. Indeed, statistics depends on this – that’s *why* you can interview so few people can get a clear picture of the overall population.

With that in mind, you’d actually need to work pretty hard to ensure that the 10:1 bias in the small sample was replicated in the larger sample. If you’re not explicitly selecting for such bias, the closer your sample size gets to the population size, the less bias you’ll have in the sample. If your sample size becomes equal to your population size, it should be obvious that there cannot possibly be any sort of bias – your sample exactly mirrors all the important demographics of your population because they’re the same set.

Consider the difference between “I polled everyone in my office”, “I polled everyone in the building” and “I polled everyone in the city”. You’re not intentionally introducing or preserving bias, so simply expanding your sample will normally reduce the bias (increase the diversity) of your sample.

Anonymous 0 Comments

The ‘magic number’ is ~1250 people for polling purposes. If you do a good enough job trying to control for your variables, 1250 data points will start to show strong trends. More is always better, but this is a good par.

Remember that number next time you hear about a clinical trial or any kind of study that requires sampling. Small trials typically are used to show safety and show minor correlations to follow in larger experiments.

For better and more in depth information, find a Political Science course on Polling/Statistical data.

Anonymous 0 Comments

Your color example perhaps needs some adjusting. There is “Truth in the Universe,” (TITU) in which we asked every single person what their favorite color is/was (FYI we can speak to the dead, just check out The Dead Files on the Travel Channel).

If we then ask 1,000 random living Americans what their favorite color might be, we will have some level of confidence that their favorite color will reflect the actual TITU. We have biased the sample by using Americans, and by using those who are alive (perhaps quite a few of the departed favor sepia), but how important is nationality with regards to color? If unimportant, we can increase the sample size to 100,000 random Americans, and we will then have more confidence that the answer more closely approximates the TITU reality.

But if nationality IS a biasing factor, then increasing the sample size while still only including Americans leaves a significant source of error, and ultimately cannot narrow the confidence interval significantly.