Eli5: What is a null hypothesis and how do type 1 and type 2 errors work.

638 views

Eli5: What is a null hypothesis and how do type 1 and type 2 errors work.

In: Mathematics

6 Answers

Anonymous 0 Comments

The null hypothesis depends on your analysis, but it usually says something like “there is no effect” or “there is no difference”.

Why do we use it? Because it’s often easier to precisely define your null hypothesis than it is to precisely define the alternative. For instance, in clinical trials the null hypothesis is typically “this treatment has no effect”. How much effect does that mean precisely? Well that’s easy: no effect means exactly 0. The alternative hypothesis is: “this treatment has some effect”. How much? Well, we don’t know. After collecting data, we can maybe given an estimate, but it’s hard to define it precisely beforehand for statistical purposes.

So, for statistical reasoning it’s typically easier to try to reject the null hypothesis, rather than prove the alternative. Since you can state the null hypothesis very precisely, you can make very clear mathematical predictions as to what range of outcomes would be plausible if the null hypothesis were true. For the alternative hypothesis, you cannot do this because there is a large, sometimes infinite range of values consistent with this hypothesis (e.g. the treatment might have an effect of 1 unit, 10 units, 100 units, 1000 units, etc.).

For instance, suppose we have a coin and we want to know whether it’s biased. Our null hypothesis is that the coin is fair, i.e. the probability of coming up heads is exactly 0.5. We then toss the coin 100 times and count the number of heads. If the null is true, then the outcome will be around 50 heads, but it doesn’t have to be exactly 50. The process is still random so it might be 51, or 49, or 44, etc. The further you get away from 50, the less likely the outcome is for a fair coin, and we can calculate exactly how likely.

For instance, for an experiment of 100 tosses of a fair coin, there is a 5% probability that we will count 42 heads or fewer. There is also a 5% probability that we will get 58 heads or more. 90% of experiments will end with a count higher than 42 but lower than 58. There is a formula to calculate these things that I won’t give here, but it also works the other way: if you know your number, you can calculate how likely that number was if you assume the coin is fair.

For example, suppose we count 60 heads. The probability of observing an outcome of 60 heads or more is about 1.76%, if the coin is fair. That’s not very high. But in statistics we often ask a slightly different question, namely: how likely is this deviation from the most likely outcome? That is, for a fair coin, the most likely outcome is 50 heads (out of 100 tosses). The next most likely outcome is 49 or 51 – those two are exactly tied for probability. Then after that we get 48 or 52, and so forth. The point is, it’s symmetrical, and it only depends on how far away we get from 50, not what side of 50 we’re on.

60 is 10 away from 50. So if we observe this outcome, we ask: how likely are we to get a number of heads that is 10 or more away from 50? The answer to that is about 3.52%. This still isn’t very high. It’s also less than 5%, which is a cut-off that is often used in statistics to decide whether the result is “significant”. This is just an arbitrary distinction that says: “this result is so unlikely under the null hypothesis, that we feel sufficiently confident to reject it”. In the case of our coin toss experiment, we would reject the null hypothesis that it was fair (and thus implicitly accept the alternative hypothesis that the coin is biased).

It’s important to realize that the threshold for “sufficient confidence” is arbitrary. And most importantly, if you follow this 5%-rule, you won’t always get it right. In fact, 1 in 20 times that you follow this rule, you will reject the null hypothesis incorrectly. In terms of our coin experiment, that means for 1 in every 20 coins that we test this way and call “unfair”, we will be wrong: that coin actually was fair, and we just got an extreme result by chance. This type of mistake, of incorrectly rejecting the null hypothesis, is called a “type 1” error.

Of course we can make a different kind of mistake too. Suppose we test a coin that in truth has a 51% probability of coming up heads. In 100 coin tosses, it will be very hard to tell that this coin isn’t precisely fair. In fact, you’ll only end up rejecting the null hypothesis in about 5.43% of these experiments. That’s not a lot, especially if you realize that for a fair coin (i.e. when the null hypothesis is actually true) we would reject the null hypothesis exactly 5% of the time.

So for our slightly biased coin, we only rarely conclude that the coin isn’t fair. In the remaining 94.57% of cases, we say “we cannot reject the null hypothesis that this coin is fair”, and that’s what we call a “type 2” error: failing to reject the null, when the null actually is false.

When you use a 5% cut-off for rejecting the null, you make exactly 5% type-1 errors (incorrectly rejecting the null hypothesis). The percentage of type-2 errors (incorrectly not rejecting the null) is harder to calculate, because it depends on the size of the true effect. If we know the true probability of heads is 51%, we can calculate how many type-2 errors we will make (94.57% of the time). But if the true probability is higher, we will make fewer type-2 errors, because that coins is easier to distinguish from a fair coin. Since we usually don’t know the size of the true effect (because otherwise we wouldn’t be doing the experiment), the type-2 error rate is typically impossible to calculate precisely, and hard to even estimate (as this requires assumptions or prior knowledge).

Type-2 error calculations are also known as “power analyses”. The power of an experiment is how often you correctly reject the null, i.e. often you correctly detect the existence of an effect or deviation from the null. As such, power is just the opposite (or, more precisely, the *complement*) of the type-2 error rate: if your experiment has a type-2 error rate of 20%, then your power is 80%, i.e. you will correctly reject the null hypothesis 80% of the time (i.e. if you repeat the experiment 1000 times, you will correctly conclude “this coin is biased” (or whatever) in about 800 of these experiments).

Of course reality is more complicated than “null” or “not null”. Consider again our 51% coin. Yes, this coin is biased, and yes we will often fail to realize this in a 100-toss experiment. But it’s not *that* biased, and if we observe, say, 49 heads, then at least we can conclude that if there is a bias, then it is likely to be small.

Now, if we wanted to be really sure, we could toss the coin a million times, and that would give us a power of nearly 100%, i.e. at the end of such an experiment we would almost always be able to say “this coin is biased”. And this is also how it is in science: the more data we collect, the more sensitive we become to very small differences.

And that brings me to my final point: the null hypothesis is rarely really true in nature. Take two coins for instance, each of which is almost exactly fair. Our null hypothesis is: these coins are exactly equally fair. How likely is that to really be the case? The answer is that it is basically impossible. If we toss them often enough, we are bound to find that one of them comes up heads *slightly* more often than the other, and we will be able to say with confidence “coin 1’s probability of heads is higher by 0.0000000000000000001%” (for example). But does that really *matter* in practice? Probably not.

In other words, what also matters a lot is the *size* of the statistical effect. This has great practical relevance, for instance in medicine. If you collect data from enough people, you might be able to prove that the effect of your vaccine (say) is more than 0. But that’s not enough for that vaccine to be useful – for that, it has to be considerably more than 0.

Now in medicine, people are usually aware of this. But in other fields, or in science reporting, they aren’t. For instance, you might read a headline that says “men are better at scrabble than women”. The stats prove it: the probability that men are not better than women is 1 in a billion. You can write a nice sensationalized article around that. But *how much* better are men really? How many more games do they win? Is it 1 in 10? 1 in 100? 1 in a million? That information is often left out, and the conclusion is exaggerated to something totally binary, which it isn’t.

You are viewing 1 out of 6 answers, click here to view all answers.