When comparing 2 different groups of people together – that is, the one getting the real medicine vs getting the placebo – you can never really get identical groups to start with. Even if both groups got placebos, your measurements will still show one group doing better than the other. If your testing medicine was ineffective, it might as well be a placebo and this still applies.
So we use math to try and quantify the probability of it being a coincidence. It’s not enough to say “Group A did 5% better than group B”. If the spread of scores within each group is 30% wide, then a variation of 5% isn’t very convincing. On the other hand if the spread within each group is only 2% wide, then 5% sounds a lot better.
The P value is the estimated probability that you have a 2-placebo scenario and the difference is just luck, calculated from the concepts mentioned above, though much more mathematically rigorous. If the number is very small – preferably less than 0.1 (10%) but ideally even less than 0.01 (1%) – then you may conclude that the medicine was actually effective. If your P value is much larger like 0.3 (30%) then you’re not really convincing anyone the medicine was the deciding factor.
Hypothesis testing and p-values are really a simple concept at their core: if we assume X is true, then what is the probability of observing something like Y?
The assumed “X” is the null hypothesis. You can think of it like the “default” state; researchers will often use a null like “no relationship exists between A and B” (although there are many other kinds of null hypotheses besides this). The “probability of observing something like Y” is the p-value.
Now, if the probability of observing something like Y is considerably low after assuming X, then we generally infer that X is “unlikely” to be true. After all, if X was true, then Y probably wouldn’t have happened. This is called “rejecting the null hypothesis”, where we state some reasonable confidence that X isn’t true. Of course, we could be wrong about this; however, the *brilliance* of hypothesis testing is that we control *exactly how often* this situation occurs. There is a bit of math involved here, but suffice to say this: by only rejecting p-values below 5%, that means we only make this mistake in **at most** 5% of cases (in the long-run). More specifically: rejecting p-values below 5% means that we mistakenly reject the null in less than 5% of experiments. This bolsters confidence in our conclusions, since we can be sure that there is a guaranteed error rate (otherwise, how would we be able to trust results from a process without guarantees?).
Some notes:
– There is another kind of error, where we don’t reject the null hypothesis when we should. The current hypothesis testing paradigm doesn’t explicitly provide guarantees on this type of error, although it can be measured and quantified. For reference, these are called “type 2 errors”, whereas mistakenly rejecting the null is called a “type 1 error”.
– When we calculate p-values, those calculations *assume that the null hypothesis is true*. This is why p-values cannot be interpreted as “the probability that the null hypothesis is true”: the p-value already assumed that was the case. Remember, the p-value is only measuring “the probability of observing something like Y (assuming that X is true)”. It does not make any commentary on the likelihood of other realities besides X, nor does it discuss the likelihood of Y under those alternate realities.
Null hypotheses, Type I and II errors—these are beyond a 5-year old, I think. So:
You want to compare two things: drug vs. placebo, radiation vs. chemotherapy, standard vs. underhand foul shooting, etc. You run an experiment and find a difference. Problem is, the difference could be real or just due to random factors. You ask a statistician, and they tell you “The difference is real, p < .05.” This means there’s less than a 5% chance that random factors caused the difference. You now have evidence (not proof, just evidence) that your treatment worked.
Latest Answers