# The Central Limit Theorem

141 views

I’m in grad school and taking statistics because I have to. I’m a total right-brained person. Art is what I do for fun. I am having a terrible time trying to understand this theorem and finding the mean and standard deviation of the distribution of sample means. Everything I’ve found so far is not helpful. Please help😭

In: Mathematics

Imagine you have a big bag of different colored candies, and you want to know the most common color in the bag. But instead of looking at all the candies at once (because that would take a long time), you decide to take small handfuls of candies and see what color is most common in each handful.

Now, if you take just one handful, it might not show the true most common color because you could just get a handful with mostly one color by chance. But if you take lots and lots of handfuls and look at the most common color in each one, you’ll start to notice a pattern. The more handfuls you take, the more the pattern you see will look like the true most common color in the whole bag.

The Central Limit Theorem is a bit like that. It tells us that if we take lots of small samples (handfuls) from a big group (the candy bag) and find the average (most common color) of each sample, those averages will form a pattern. And this pattern will look like a smooth hill, with the top of the hill being the true average of the whole big group.

So, even if we don’t see all the candies, by taking lots of small samples, we can still get a pretty good idea of what the most common color is in the whole bag!

I’m going to assume your class is teaching frequentist statistics as these tend to be more commonly used in other fields. If your class is teaching Bayesian statistics (uses words like “prior”, “posterior”, and “*credible* interval”), let me know and I’ll update.

Statistics is the science of taking things we observe and trying to make the best guess of what the “truth” is. You can think of it as the inverse of probability. Probability says “This coin is fair and has a 50/50 chance of landing on either side, so let’s make some predictions about what will happen when I flip it a bunch of times”. Statistics says “I have a coin, let’s flip it a bunch of times, record the results, and see if we can figure out whether it’s fair or not”.

Let’s say you want to figure out the “average” age of all Americans. In theory, you could survey every single American, figure out their age, and find the “true” value. Such surveys are called a “census” when they include (close to) 100% of the population.

In practice, this is rarely feasible. Instead, you can only survey, say, 1,000 people, but you are hoping to make an educated guess about *all* Americans based on those 1,000. This is where statistics comes in.

Let’s assume you are able to use a completely unbiased sampling method (much more difficult than it sounds). If you want to know why I specify this, imagine the different results you might get if you surveyed 1,000 random college students versus 1,000 random residents of nursing homes. Both of those would be biased methods.

Those 1,000 ages you recorded are your “sample”. The “sample mean” is the average of that sample. In this case, your sample mean – the average age of the 1000 people – is your best guess for the average age of *all* Americans.

Statistics comes in when someone asks “How sure are you that you didn’t just happen to survey 1,000 super young people?” One way to address this question would be to do *another* survey of 1000 random people and compare the two sample means. If they are close, you can use that as evidence that you’re probably close to the “true” value.

But how do you know your *second* sample wasn’t *also* a fluke? You might say “well the chances of both samples being a fluke in the exact same way are pretty low”. And you’re right! Statistics is the science of putting numbers to that claim.

The central limit theorem tells us that, if you do this process of taking a sample, taking the mean of that sample, and then repeating, those “sample means” will start to follow a very consistent pattern: a bell curve, aka a normal/Gaussian distribution. The center or “peak” of the bell curve will be the “true” average age of Americans. The width of that bell curve (roughly speaking, the variance or standard deviation) depends on both the underlying distribution (ie, how much variation there is across ages of Americans) *and* how big your sample size is. In this case, our sample size was 1000.

Once you have your first sample, you can make an educated guess about what exactly that bell curve will look like using statistics, and that will give you some insight into how far “off” your estimate *probably* is. For example, you might say “I’m 95% confident that I am within 2 years of the correct age, and I’m 99% confident that I am within 5 years of the correct age. I’m only 80% confident that I am within 1 year of the correct age.” (Numbers are made up but qualitatively accurate). The CLT is what lets us make these statements.

Now, there is one very important caveat to the CLT, which is where the “limit” part comes from: the distribution of those sample means won’t be *exactly* a normal distribution. However, as the size of each sample gets bigger, that distribution will become closer and closer to a perfect normal distribution.

This part is a bit extra: The real power (and super cool part to my nerdy brain) of the CLT is that it applies *regardless of the underlying distribution*. Age is “easy” in that it tends to follow something similar to a bell curve anyway, but not everything does. Maybe we are instead surveying net worth in dollars, which will be much more skewed/uneven. Plotting the distribution of the net worths themselves probably won’t look anything like a bell curve. The CLT says that doesn’t matter – once your sample size gets big enough, the distribution of the *sample means* (not the sample data itself) will become a bell curve.