Eli5: What is standard deviation ?



i googled it, but still understood nothing 😛

What’s this thing??

In: Mathematics

You’re probably familiar with the concept of an average or *mean*. In some sense, the mean tells you “where your data is” – is it mostly big or small, is it close to 0 or far away, etc.

Standard deviation tells you *how far from the mean your data is* on average. It’s a measure of how “spread out” your data is – a high standard deviation means your data is spread out, a low standard deviation means it’s piled up close to the mean. It’s a tad more complicated than this, but that’s the way you want to think about it conceptually.

Mathematically, you get the standard deviation by figuring out how far each point is from the mean, squaring those distances, averaging them, and then taking the square root. For example, if you have five people whose heights are {50, 55, 60, 65, 70}, we can easily compute the mean (60). Then the distances from the mean are {10, 5, 0, 5, 10}. Squaring these gets us 100, 25, 0, 25, 100, and averaging those gets us 50. Then the square root of 50 – about 7 – is our standard deviation.

Compare that to a different collection of heights, say, {60, 70, 80, 90, 100}. This collection has a larger standard deviation of about 14. We can interpret this as saying that the second collection {60, 70, 80, 90, 100} is in some sense “more spread out” than the original collection {50, 55, 60, 65, 70}.

Once you have an average, the standard deviation is “how far away from average are people, typically?”.

If you count how many heads people have, the average is 1 and the standard deviation is basically zero. Most people’s head number is zero units away from the average.

If you count how many fingers people have, the average is 9.98-ish (I’m making these numbers up) and the standard deviation might be 0.02. Most people are very close to the average number, but thanks to polydactyly and table saws there is some variation.

If you have a test where half the class got 0 and the other half got 100, your average is 50, but your standard deviation is also going to be about 50, because all your people are very far away from the average. (Everybody, in this example, is either 50 above or 50 below the average.)

In other words, a small standard deviation means “this average represents the group really well; most people are only a small distance from the average” and a big standard deviation means “this average is not a good representation of the group; many people are a big distance from the average”.

It’s a measurement of how spread out your data is.

For example if your data set is {1, 2, 3}, the standard deviation is ~1.4. However if your data set is {0, 2, 4}, the standard deviation is 2.8. The more spread out your data is, the larger the standard deviation will be.

There are a lot of ways to measure how spread out your data is. Another way would be to measure the average distance from the mean. This is called the MAD, and it would be ~0.67 and ~1.3 respectively for the data sets above.

Standard deviation in particular is used because it’s mathematically well-behaved in ways that are convenient for us. It has simple relationships with other derived values of distributions, and it is closely related to the [normal distribution](https://en.wikipedia.org/wiki/Normal_distribution), which is a very common distribution both in nature and in mathematical models.

You have a bunch of numbers – exam scores for example. Find the mean of these numbers (70). Are your scores all close to the mean or are they really far away (did most people score around 70 or did you have a lot of very low/high scores)? A 72 is two above average while a 68 is two below, but both are simply two points from the mean. The standard deviation tells us (roughly) the average difference from the mean for all of the scores. If you have a standard dev of 1, then people’s scores are on average 1 away from the mean exam scores. A higher standard deviation will imply that people’s the average difference is higher.

Imagine you measure the height of every adult male (or adult female, but maybe not both to ensure that you don’t get two peaks) in a single country (let’s go with my country of the UK). If you plot the number of people (y-axis) measuring a given height (x-axis), you’ll probably get some sort of bell-curve. The mean height would be the height of the bell-curve at its centre. The width of the bell, however, would be represented by the standard deviation.

The standard deviation is an important quantity for the following reason. Let’s say you find from your graph that the mean adult male height is around 5’9” (1.75m). Now let’s say you come across a guy who’s 6’0” (1.83m). Would you consider him particularly tall? Well, if the standard deviation of adult male height in the UK was six inches, then 6’0” would be half a standard deviation above the mean, which means that around 30% of the British adult male population would be his height or taller, ie nothing particularly special. What about if the standard deviation was three inches? Then he’d be one standard deviation from the mean (‘one sigma’), so around 16% of the adult male population would be taller than him, so he’d be on the tall side but still nothing to cry home about. What if it was only one inch (‘three sigma’)? Then only around 0.15% of the population would be taller than him. We’ve now got a BFG on our hands.

So let’s say you’re a scientist looking for methane (a possible sign of life) in an exoplanet atmosphere. Suddenly you come across a spike that looks like an absorption feature of methane. But your data is noisy, so how can you be sure that it’s genuinely methane and not just noise? Well what you can do is you can measure the average amount of noise in your spectra, which gives you your one standard deviation noise distribution. Now, if your methane spike is 5 times larger than your average noise level, then you have a ‘5 sigma detection’, which means that there’s about a 1 in a million chance that your methane detection is due to noise in the data, which reassures your reviewer that what you’ve discovered on your planet is indeed methane beyond any reasonable doubt.