I do not understand why its harder to find a significant difference in data when you do more comparisons

198 views

I am a grad student desperately trying to analyze her data. I am having a hard time understanding why correcting for the amount of tests I’m doing (Bonferoni and Tukey) is taking away my significance. I have 4 factors across 3 timepoints and when I run stats on each factor across the timepoints, they are significant. When I put them all together on one graph (all four factors across all 3 timepoints), they are no longer significant. I understand how Bonferoni works, what I am asking is why does it feel like I am being punished with stricter p-values when I am being more thorough? I feel like this correction encourages people to break down their data in order to get significance, which feels icky. Im wishing I would have studied just one of the factors across the timepoints instead of all 4.

In: 4

4 Answers

Anonymous 0 Comments

Corrections for multiple hypothesis testing are meant to combat the fact that, no matter what significance level you set, if you perform enough statistical tests, you’re guaranteed to find at least one “significant” result. While this, in some sense, punishes thoroughness, it also punishes running a bunch of unmotivated tests and cherry-picking the ones that come up significant at random.

If I’m reading your comment correctly, it sounds like you performed 12 tests, all of which came back significant, but perhaps you don’t have a ton of statistical power, so a standard Bonferonni correction (which just divides alpha by 12 in this case) kills all of them. This is a case where you could likely just present the 12 results and be fine. To make it extra-clear you’re not cherry-picking results, you could explain how each test is important and logical to do, and how you did not do any other tests. (If it’s not true that each test needed to be done, then you could consider just picking one to highlight, though it would have been better to register this intention *before* doing your tests).

You also have statistical recourse. You could do a joint test of all the hypothesis, which should be enough to establish that at least one rejects the null. There are also certainly multiple testing corrections that are less than 80 years old that could better account for the fact that you’re getting multiple significant results or are better tuned to low-power environments.

You are viewing 1 out of 4 answers, click here to view all answers.