I am a grad student desperately trying to analyze her data. I am having a hard time understanding why correcting for the amount of tests I’m doing (Bonferoni and Tukey) is taking away my significance. I have 4 factors across 3 timepoints and when I run stats on each factor across the timepoints, they are significant. When I put them all together on one graph (all four factors across all 3 timepoints), they are no longer significant. I understand how Bonferoni works, what I am asking is why does it feel like I am being punished with stricter p-values when I am being more thorough? I feel like this correction encourages people to break down their data in order to get significance, which feels icky. Im wishing I would have studied just one of the factors across the timepoints instead of all 4.
In: 4
Corrections for multiple hypothesis testing are meant to combat the fact that, no matter what significance level you set, if you perform enough statistical tests, you’re guaranteed to find at least one “significant” result. While this, in some sense, punishes thoroughness, it also punishes running a bunch of unmotivated tests and cherry-picking the ones that come up significant at random.
If I’m reading your comment correctly, it sounds like you performed 12 tests, all of which came back significant, but perhaps you don’t have a ton of statistical power, so a standard Bonferonni correction (which just divides alpha by 12 in this case) kills all of them. This is a case where you could likely just present the 12 results and be fine. To make it extra-clear you’re not cherry-picking results, you could explain how each test is important and logical to do, and how you did not do any other tests. (If it’s not true that each test needed to be done, then you could consider just picking one to highlight, though it would have been better to register this intention *before* doing your tests).
You also have statistical recourse. You could do a joint test of all the hypothesis, which should be enough to establish that at least one rejects the null. There are also certainly multiple testing corrections that are less than 80 years old that could better account for the fact that you’re getting multiple significant results or are better tuned to low-power environments.
The more distinct tests you do, the more likely you are to find patterns that aren’t there. You realized this instinctively when you said that breaking the data up to increase significance is icky.
A great example of why is this xkcd https://xkcd.com/882/
If you run 20 different tests at a 5% significance level, you would expect one of the tests to find a pattern even if none is there
It’s worth noting that Bonferroni is a *really* conservative (erring on the side of non-significance) correction. It’s an easy option but generally not a great one — there are other procedures, probably available in whatever stats software you’re using or just out there for you to look up, that can be perfectly appropriate to use.
Also, there are cases when it’s not really necessary to perform multiple testing corrections at all, or to perform your corrections in smaller groups… which you noted can be *icky* if it comes from a motivation of just p-fishing, but is much more defensible when you do it in a way that aligns with your actual hypotheses (for instance, correcting the timepoint-wise comparisons separately per factor). But opinions about that differ a lot and what is appropriate to one statistician may not be to another.
Bonferonni is too strict and superseded by holm-bonferroni anyway, which is less strict but still does a good job maintaining alpha as the threshold.
Anyway, it depends what you were investigating exactly with the multiple factors. E.g., in the extreme case, each is exactly identical and that would make correction pointless (because it’s really just one factor). So the fact the bonferonni makes them not significant isn’t necessarily a big deal. I’d have larger questions about how separate the factors really are. If numeric are they highly correlated, like .8? Or are they truly unique? If they’re highly correlated, then bonferonni correction doesn’t actually do any good (meaning it isn’t actually protecting type I error rates, it’s unnecessarily making it harder to detect anything). There are methods above my head that take into account the correlation of the “independent” tests through bootstrapping.
If they’re unique, then as the other responder said, it’s fair to just say these factors are individually significant, even if bonferonni (or a better correction) eliminates the effect, that doesn’t mean it isn’t interesting and worth following up on. You don’t have the power for four separate tests maybe, but all of them were individually significant and that’s still interesting. Correction is just another tool for decision making. I’d be more skeptical of the results if it was just one effect was significant of the four, and then it didn’t survive correction.
Latest Answers