what “statistical power” actually means, why one statistical test would be more powerful than another when applied to the same data, and when one might want to use a more or less powerful test.


I have a science background, but stats has always been a weak point for me. The tests I’m thinking of specifically are Fisher’s Exact Test, Barnard’s Test, and Boschloo’s test, but I’d like to understand the concept generally. Fisher’s is generally the go-to standard for my field, but from my understanding, Barnard’s is *sometimes* more powerful than Fisher’s, and Boschloo’s is *”uniformly”* more powerful than Fisher’s. To my not-understanding brain, that sounds like Boschloo’s should have long since made Fisher’s obsolete, so I’m looking for clarification on what “power” actually means, as well as why something like Barnard’s could be more powerful in some cases but not in others.

In: Mathematics

To quote [Wikipedia](https://en.wikipedia.org/wiki/Power_of_a_test):

> The power of a binary hypothesis test … is the probability that the test correctly rejects the null hypothesis when a specific alternative hypothesis is true.

The examples you are using are for contingency tables, so finding if there is a link between the categories of data. Your null hypothesis would be that there is no link, your alternative hypothesis would be that there is some link. When you do the test you get a p-value, which is the probability of getting the data you have if the null hypothesis is true.

So there are four outcomes:

1. there is a link, and the test says there is a link,
2. there is a link but the test says there isn’t a link (Type II error / false negative),
3. there isn’t a link but the test says there is a link (Type I error / false positive),
4. there isn’t a link, and the test says there isn’t a link,

The power of the tests tells you about 1. It is the chance – if there *is* a link between the categories – that the test will give you that result.

This isn’t always the most important thing to care about, and there are different measures of how useful a statistical test is, but there will be situations where it matters.

So the idea of tests being more powerful is that a more powerful test will do a better job of correctly saying there is a link (or correctly rejecting a null hypothesis). Or to put it another way, a less powerful test is more likely to give you a Type II error (a false negative).

With Fisher’s and Barnard’s, there will be some situations (from reading Wikipedia, this might be when the underlying sampling distribution is hypergeometric) where Barnard’s test will be more powerful – i.e. if you apply both tests to the same data you are less likely to get a Type II error from the Barnard’s Test. But other times Fisher’s test will be more powerful.

As to why you would use a less powerful test, I haven’t studied any of these tests in detail I can think of a couple of possibilities; firstly power isn’t necessarily the most important factor – it might be more important to minimise Type I errors than Type II errors, or focus on things like sensitivity and specificity. Secondly, it might be that the more powerful test is harder or more time-consuming to run.

There’s a popular metaphor going around:

You send a child down into the basement to look for a particular tool. The child comes back and reports the tool is not there. What are the odds that the child is correct?

If the basement is well lit and well organized, the tool is large, and the child looked for a long time, then the odds are pretty good that the child is right and the tool is not there.

If the basement is dark and cluttered, the tool is small, and the child only took a very quick look, the odds are pretty good that the child is wrong and the tool *is* down there.

When the basement is dark and cluttered and the tool is small, the child needs to look for a much longer time before you can conclude that the tools is not there.

In this metaphor, the tool is the correlation we’re looking for. The clutter is the amount of noise in the data. The size of the tool is how strong the correlation is, and the child’s statement that the tool is not there is the null hypothesis.

The time spent looking is the amount of data collected. You want to collect data until you’re reasonably sure (typically 95% sure) that the child was correct.

(I think I got part of that backward, but I don’t know which part.)