For a statistical sample with a 95% confidence level, explain what occurs that other 5% of the time. Is that sample more likely to be close (but not within) the margin or error than to be far away?

In: Mathematics

In theory, in the general case, it can be either way — **it depends entirely on the probabilty distribution**. However, for most common distributions (like Gaussian), it’s acutally the case that, as you say, “the 5% is more likely to be close (but not within) the margin or error than to be far away”.

See it like this: to say “95% confidence level” is like saying: “95% of the sand inside this box is on the left half of it”.

“What about the other 5% of the sand?” you ask “I know it is on the right half of the box, but is it more likely to be near the center, or farther on the right?”.

The reality is, you have been told literally nothing about that 5% of the sand (except that it’s not on the left); it *could* be piled all on the far right side, and it could be shaped as a duck sand statuette. There’s nothing against this. But, *if* you assume that the sand in the box is just in a big conic pile (which is a common way for sand to be distributed), *then*, by knowing that 95% of it is on the left half, you also know that most of the 5% on the right is closer to the center than to the right-end side of the box.

——————————-

Edit: **for example, compare**:

Case 1: 95% of the people in my class is male (with 95% confidence, a random person from my class will be male). Are the other 5%, the not male ones, likely to be more masculine and not completely feminine? Ansewer: no, because males/females tend to be **distributed** in two separate clusters, with statistically very little in between.

Case 2: 95% of swedish men are taller then 180cm. Does it mean that the other 5%, shorter than 180cm, are likely not to be much shorter than that? Answer: yes, because people’s height tends to be **distributed** as a gaussian dustribution (a bit like the pile of sand in the box).

Conclusion: **it depends on the distribution.**

> Is that sample more likely to be close (but not within) the margin or error than to be far away?

It depends on the distribution. But for what you’re thinking of, the answer is most likely yes. You’re probably thinking of a Gaussian curve (“bell curve”) and it [looks like this](https://miro.medium.com/max/24000/1*IdGgdrY_n_9_YfkaCh-dag.png).

In a bell curve, 95% of the data is within 2 **standard deviations** of the **average**. For example, a person’s IQ has a **standard deviation of 15** and an **average of 100**.

2 standard deviations is then 30. So within 30 of 100, really means as low as 70 or as high as 130. 95% of people will have an IQ between 70-130.

5% of people have an IQ outside of 70-130, but you’re correct to assume you’ll see most of this 5% hover around 69 or 131 instead of some outlandish IQs like 170. Let’s go back to the [picture of a bell curve](https://miro.medium.com/max/24000/1*IdGgdrY_n_9_YfkaCh-dag.png) I showed you…

* “Within 1 standard deviation” is marked by the pink color. In the IQ example, that would be the range of 85-115.

* “Within 2 standard deviations” is the pink color **and** the blue color. In the IQ example, that would be 70-130.

* “Outside of 2 standard deviations” is the other 5%. But you see how there’s more green than orange? Your guess is correct: even if you’re outside 95%, you’re still likely to be close to it (green) than you are to be far away from it (orange).

When trying to estimate the tendency of all things (typically humans) from a small selection of them (maybe a few hundred at most) organized into groups for experimentation, you have to make a few assumptions. The big one is:

* Your groups are as identical as you can possibly make them

Obviously that’s really hard to nail down because you can’t possibly get 2+ identical groups. Even if you selected a hundred sets of twins and split them between the groups there are still differences between them. If you’re conducting a medical experiment and by sheer rotten luck everyone in group 1 was a smoker and everyone in group 2 doesn’t smoke…. well, your results are going to be inaccurate as a result. If you don’t catch that mistake, your results are not going to be trustworthy.

Another big assumption that might be wrong:

* Your small group(s) statistically represents the entire population

Obviously this can also be very wrong without you realizing it, in many of the same ways. Again, a group of 100% smokers (by rotten luck) doesn’t represent all humans.

So the 95% confidence says how sure you are the results are correct and that your results correctly measured whatever it is you’re measuring, vs the 5% chance that you got shitty group assignments and the results are invalid as a result.