Can someone please help me understand NHST?

I study psychology and would like to have a good basics knowledge of relevant statistics. Thank you.

In: 8

This is critical because you need to be able to accurately reproduce the results of the question, “what would happen if my test drug did nothing at all”

If you don’t know that answer, than all your testing assuming the drug works will just affirm your beliefs and you will not view them in a proper neutral mindset

You have an idea that for whatever reason men take longer to do some cognitive task than women do. You run the experiment on 37 people and find that on average men take 1.2 seconds longer than women do.

Null hypothesis testing is asking “IF there were really no difference between how long it takes men and women to do this task, how hard would it be to get a sample of 37 with a difference of 1.2 seconds or longer?” and you end up with a p-value.

Yes, it’s a weird and backwards way of thinking about it. We do it this way because null hypothesis testing converts a difficult Bayesian problem (How likely is it these results reflect a true difference and not just statistical noise?) into a much simpler sampling problem with dramatically easier math. Computers nowadays are fast enough to just do the Bayesian problem if you’d rather, but the process is still complicated. Just in different ways.

You can’t definitively prove most things in science. So rather than proving a hypothesis to be absolutely true you seek to prove that it is more likely to be true than not.

For example, no one has ever proven that smoking absolutely causes cancer, there is always a very slight chance its all just a crazy coincidence. However u can prove that it is far more likely that smoking causes cancer than that smoking has no effect on cancer. Each study showing smokers with higher rates of cancer than non smokers when trying to control for other factors makes the null less likely, and strengthens the alternative hypothesis that smoking causes cancer. Apply this to anything and think of every experiment as an argument made against the null in order to increase support for your alternative hypothesis.

This is also an interesting case historically as the original studies linking smoking and cancer were some of the first truly modern scientific studies using stats. There has actually never been a proper randomized controlled trial on smoking, in part because it would be unethical as we are so certain it causes harm.

Often, we will want to test what the difference is between two sets of data.

Whether it’s testing crop growth in two different fields, differences in height between males and females, or whether or not a trial lubricant gives better machine performance.

There will be something you’re measuring. E.g. for crop growth it might be crop height.

There will be something different between two sample sets. In this case, it’s the field that the crops are growing in.

You want to tell if there’s a difference in crop height between the two fields. If there’s no difference between the two fields, then the “null hypothesis” would be *there is no difference in height between the two fields.*

The null hypothesis is the boring assumption. It’s “nothing interesting is going on here”.

The alternative hypothesis is the interesting one. It’s the one that says “Something is not boring here”.

To set your hypotheses, you decide “What am I looking to find out?” *Crops are a different height in Field A vs Field B*.

The null hypothesis then is the boring answer that you need to default to if you can’t prove anything interesting. *The crops are the same height in Fields A and B.*

We can then use canned statistical tests like the “Student T Test”. We go measure the heights of a random sample of crops from both fields, take the average, standard deviation, and number of samples, and then use those to calculate t-scores and eventually a p-value.

The p-value will be a number between 0-100%. It’s the *probability that your test gave these results by random chance*.

E.g. Imagine you get an average height from field A of 2.05 meters, and an average from B of 1.86 m (difference of 0.19 m).

Using the standard deviation, average, and number of samples, you calculate your t-scores and p-scores, you get a p-value of 0.03. This indicates there’s a 3% chance that your samples just *happened to come out that way*, or a 97% chance that there’s a real difference between the fields. In this case, you’d likely reject the null hypothesis and say “There is statistical evidence that the crops in Field A are 0.19 m taller than in Field B”, presumably followed by “lets make sure we do what Field A is doing in the future.”

Another way to think of the p-value is “If fields A and B were actually the same height, there’s a 3% chance that I happened to grab samples that said one set was taller than the other.”

A way to remember what the p-value means is “If the p is low, the null must go”. It rhymes you see.

If the p value were higher, then there’s enough possibility that it was just random chance. A typical threshold is 5% or 0.05. So if your p-value came back at 6%, then that’s too much of a chance it was random variation that happened to tell you the averages are different. If you still think there might be a difference, you should repeat the test with a higher sample count and perhaps better measurement methods.

Otherwise, if the p-value is too high, sometimes it can come back at 80% or higher, indicating “There’s an 80% possibility that any difference you see is random chance”. In that case, you *fail to reject the null hypothesis*, and all you can conclude is that there’s no reason to believe there’s a difference in crop height.

Things can happen randomly or for reasons outside of the reasons you are studying.

NHST is where you say “what if the thing I’m studying doesn’t have an effect, so these results I’m seeing are either normal or are from luck?” and you try to check how likely it is that any results you might see aren’t actually from the thing you want to test.

We do it this way because we *don’t know the effect of the thing we are testing* (yet) so it’s hard to prove what our results would look like if it *does* have an effect. How strong do we expect the result to be? We don’t know. But we do know how luck behaves and we do know what “normal” looks like (or we can test to find out) so it’s mathematically much more practical to check if your result looks like normal/lucky or truly extraordinary/significant.

For example you test a new drug and 30% of patients get better. Without the drug, normally 25% of patients get better. So at first glance this looks good. But not *exactly* 25% of patients get better every time you test this. For a small group of people it often ranges from 15%-35% just because of random chance. You can’t say for sure the drug did anything, you would expect a result like this without the drug.

On the other hand maybe you test a larger group. Tested in this way, the 25% “normal” result gets more consistent. With a large group perhaps it is only normal for the result to vary from 22%-28% of patients improving. If you still see 30% of patients recovering after they take the new drug, that suggests it really did something as this is not a result we’d expect to see without the drug due to random chance. Of course, testing with a larger group doesn’t guarantee your drug is going to work, maybe you test it again in these circumstances and now only 25% of the patients taking the drug recover. Your larger test is more accurate and it’s easier to confirm if the result is random or a real effect, but in that last example, you’d say this looks random/normal and it does not show the drug did anything.