# How do statistical tests prove significance?

245 views

I did a biology undergraduate degree and often did reports where would statistically analyse our results. P value of less than 0.05 shows that the results are statistically significant. How do these tests actually know the data is significant? For example we might look at correlation and get a significant positive correlation between two variables. Given that variables can be literally anything in question, how does doing a few statistical calculations determine it is significant? I always thought there must be more nuance as the actual variables can be so many different things. It might show me a significant relationship for two sociological variables and also for two mathematical, when those variables are so different?

In: Mathematics

The basic idea is that when doing an experiment, you get a certain value with a certain probability, which can be described by a probability function. Most often, it’s assumed that your probability function is the normal probability distribution, aka the bell curve.

Then you fit 2 such curves to each of your observation sequences (e.g. patients that received a drug, and those who received a placebo), and look at how much these bell curves overlap. The less the overlap, the lower the probability that it’s just different by random chance.

“Significance” in this context doesn’t mean “this is true”, it means “The chance this is true is pretty damn high”. Generally speaking, the stronger the correlation, the more true it’s likely to be. The P value is essentially a value of how likely it is that the results you got were just a fluke – that there’s no pattern at all and the data just happened to come out looking like there was. The tests that determine P value just look at the data in the abstract and the amount it deviates. The lower the deviation, the lower the P value, because it’s very unusual for random chance to produce results with very low deviation. It could still happen, which is why the P value isn’t 0, all you’re doing is saying “The chance random chance produced *these* results is sufficiently low that we can decide the correlation is significant and therefore reproducible”.

Also, there are cases where a P value of 0.05 is still too high to be confident the correlation is actually there. In some cases, the results won’t be considered significant until the P value is 0.01, or even lower.

When we find a relation with a small p-value we are essentially saying there is a small chance that the relation is due to random chance. This then allows us to accept a hypothesis as a certain level of confidence.

When using p-values and other statistical methods you are looking to either accept or reject a hypothesis that you have created. Frequently, we use the null and alternative hypothesizes for simplicity. When creating a testable hypothesis it needs to pass the sniff test. Historically, the butter production in various countries has had a an significant relation to the returns of the S&P 500. This doesn’t mean that the relation is neccesarily true though.

With the ridiculous number of possible relations in our data rich world there will be significant relations between variables that make no sense. The probability of getting 10 heads in a row is incredible small but in a set of 100,000 flips its actually fairly likely. The way to get around this is either using common sense in data and relation selection or to find significant relations in comparable data.

The correlation between worldwide non-commercial space launches and Sociology doctorates per year is high but does it really mean anything? Maybe if space launches correlate with science funding and total doctorates also increase with global science funding? Maybe the US has the majority of space launches and so it makes more sense. A high probability does not imply truth.

https://blog.psyquation.com/es/correlation-with-a-twist/

statistical test cannot “prove” significance. In fact, you cannot prove statistical significance at all, you can only measure it. There are many techniques to measure it, but then you usually get a couple of numbers. Usually, most important number is the p-value, (other numbers include size effects).

p-value stands for probability. It measures the probability that nothing happened, but you got good results anyway due to luck.

For example. Let’s say I claim to be a psychic who could control chaos magic like Wanda and determine the result of a coin toss. How many heads in a row is enough to convince you that I am a psychic?

If I throw 2 heads in a row, you might just call me lucky. If I throw 5 heads in a row, you might think that I might be up to something. If I throw like 20 heads in a row, I will definitely get your interest. Either I got an excellent throwing techniques, or there are tricks in the coin, or I’m a real psychic, but you would be pretty sure it is not up to chance.

So maybe for you the limit is somewhere between 5 and 20 coin tosses. If you do the statistical test, the P-value of 5-head is 0.03125, while for 20-head is 9.5e-7.

Now, the same with biology, let’s say testing if a medicine is working. How do we know if a medicine is working, or is just up to luck?

*************************************

Well, you wanna find out the p-value. to do that, you use one of the many statistical tests. These are tools that people can misuse and abuse. In fact, it is kinda hard to get it right.

And then you get a p-value. Different fields have different standard. It seems that you are familiar with p<0.05, which is 1 in 20 chance that it is luck. Other fields use p< 5sigmas, which translates to 1 in a million chance.

https://news.mit.edu/2012/explained-sigma-0209

Basically you come up with the “null hypothesis” which is close to saying “how likely is this to happen by chance?”

So lets say I claim I can toss a coin to land however I want, most of the time. How do you test this?

Say I toss 4 heads and 1 tails while trying to make it always heads. The default is that it’s 50:50 and I have no effect. So how likely is it to toss at least 4 heads in 5 tosses? 0.1875

So we’d say that wasn’t significant at the 5% level. Since it would happen 19% of the time by chance!

Now if I did 11 heads and 1 tails, that would be 0.00317382812, which is 3% and so < 5% which is commonly used as the significance threshold. (Although that’s quite arbitrary, you can choose any threshold before you start.)

What this calculation doesn’t do is tell you how effectively I can control it being a head. Just that I can deviate from the normal result enough that I can produce otherwise unlikely events.

> For example we might look at correlation and get a significant positive correlation between two variables. Given that variables can be literally anything in question, how does doing a few statistical calculations determine it is significant?

You’re doing it wrong.

You start with a null hypothesis. This is the thing you want to show is false.

For example, you think that people who eat more apples also eat more pears. So your null hypothesis is that people eat the same number of pears regardless of how many apples they eat (no correlation). Then you go get data and test it.

But people also eat plums, and that might affect whether they eat pears! So you include plum eating in the formula.

If you find a correlation between eating pears and eating plums with a P value of less than 0.05, is that statistically significant?

No, it is not. Why? Because that is only your hypothesis BECAUSE you found a correlation, which biases your results. You might have had 20 different fruit you were correcting for, which would mean odds are at least one would have a correlation – even if there was no correlation at all.

Doing random P testing on things you don’t think to have correlation is simply wrong. You might well do that at a preliminary stage to find out what hypothesis to test in the first place, but the data you use to come up with the hypothesis cannot be the same data you use to test it.

To see why, imagine a man who sees a coin tossed 4 times. Each time it comes up with heads. He thinks, maybe the coin is biased. He then tests that by using his observations of the coin coming up heads 4 times, and, low and behold, the data backs it up – this coin is not a fair coin! That’s crazy, right?

> Given that variables can be literally anything in question, how does doing a few statistical calculations determine it is significant?

Basically, P values tell you, “what are the odds we’d get this if our null hypothesis was true?”. If it’s 0.05, that suggests but does not prove that your null hypothesis is false. Go do more testing.

>For example we might look at correlation and get a significant positive correlation between two variables. Given that variables can be literally anything in question, how does doing a few statistical calculations determine it is significant?

I think you’ve been thrown by one of the (many) confusing things about statistical significance. It took it a while for this to click with me.

A test of significance has *nothing* to do with what you’re actually measuring. Figuring out the relationship between two variables is about picking the right test of correlation (and ensuring the logic of such a relationship).

Significance testing is about *sampling*. (Which in this case could also be repeating an experiment.)

Imagine I have two bags filled with loads of red and blue balls. I want to know if the proportions are different between the two bags, so I pull out a random sample from each bag.

Now imagine I have two big groups of men and women. I want to know if the proportions are the same between the two groups, so I draw a random sample from each.

I’m looking at completely different things, but the statistics of *sampling* are the same.

Now actually doing significance testing right is rather more complex than that. To start with you need to decide what level of significance you need. Mathematics can’t tell you that – “what level of risk am I willing to take that my results are down to chance?” A 95% confidence level is just a convention, and as another commenter has said, the convention is not always 95%.

Also, using significance tests assume the null hypothesis is true. This leads to a problem like screening tests – if most people don’t have a disease then you’ll get a lot more false positives than if most people do have a disease. In a similar way, if a hypothesis is unlikely to be true, it’s more likely you’ll get a false positive than if it’s likely to be true. But factoring that in means making a guess at how likely your hypothesis is to be true, which takes us into the contentious world of Bayesian statistics.

“The problem is that there is near unanimity among statisticians that p values don’t tell you what you need to know but statisticians themselves haven’t been able to agree on a better way of doing things.”

That last bit is probably a bit complex for ELI5… I might be able to explain it better if anyone wants. (Or [here](http://www.dcscience.net/2020/10/18/why-p-values-cant-tell-you-what-you-need-to-know-and-what-to-do-about-it/) is a more technical explanation for anyone who wants that.)

Short eli5 answer: P Value is “Assuming our hypothesis is wrong, what are the odds that we got this result by chance?”

It’s not proving or disproving anything, including a relationship between two variables. All it’s doing is saying that assuming our hypothesis is wrong (aka null hypothesis/status quo is ‘true’), you are (P Value*100) percent likely to see the result we got.

What you’re getting at is that “significance” doesn’t really mean “significance.”

A better term for “statistical significance” is “statistical discernibility.” You measure some X and Y 100 times and find a correlation of 0.42 with some standard error. Then you ask, “Okay, if the true correlation between X and Y were zero, how hard would it be to draw a sample of 100 with a sample correlation of 0.42 or more extreme?” The answer to that is your p-value. If the p-value is low, you’re saying “We can’t ever know the exact true correlation, but we can be very confident that it isn’t zero.” You’re saying your result can be discerned or distinguished from zero.

But, statistical significance doesn’t mean that it substantively matters. That’s a matter for effect size and the confidence interval around it. Suppose you’re researching the effects of eating blueberries on human longevity, and find a statistically discernible effect. If that effect is “You would have to eat the entire mass of the earth in blueberries every year to extend your life by one month,” it doesn’t really matter even if the p-value is 0.0000000001.

Statistical significance also doesn’t mean causality; the usual examples here are Tyler Vigen’s spurious correlations. X and Y can go together because:

* X causes Y
* Y causes X
* They both cause each other simultaneously
* There’s some Z that causes X and Y
* Other stuff I’m forgetting
* For literally no reason at all

Figuring out causality is, mostly, a research design question and not a statistics question. There are circumstances where causality is relatively straightforward statistically, but you have to be able to perform true experiments or have to luck into the right kind of data being available.

When you can’t do a true experiment and you don’t have that lucky kind of data, what you mostly do is ask “What would the world look like if I were right? What would it look like if I were wrong?” If you’re right, more X goes with more Y, and more A goes with less B, and so on. What you’d like to do here is have a whole set of things to look at, some of which are weird or surprising. You can see a lot of this with early covid or other epidemiological study — if it’s spread by air, then we should see these relationships between variables, but if it’s spread by droplets, we should see other relationships that we wouldn’t see if it were by air, and if it’s spread by contaminated water we should see yet other relationships.

This book [https://www.amazon.com/How-Not-Be-Wrong-Mathematical/dp/0143127535](https://www.amazon.com/How-Not-Be-Wrong-Mathematical/dp/0143127535) has a very good explanation of the history and interpretations of “significance.”

Generally, a statistical test can show that the probability that something happens under a certain assumption is very low (often, lower than 0.05 or 5% is used as a threshold). This generally gives evidence that the assumption is wrong. For example, I can flip a coin and I can make the assumption that the coin is fair (heads and tails are equally likely). If I flip it and it comes up heads 19 times out of 20, I can reason “if this coin were fair, then the chance of getting 19 heads is extremely small, so this coin is probably not fair”.

We can do the same thing for correlation. If we have a bunch of observations and each observation contains two variables (say, we measure the height and weight of a bunch of people), we can make the “assumption” that height and weight are not related to each other, aka, independent. Under this assumption, it is highly likely that the correlation coefficient will be quite close to 0. If we compute the correlation and find that it’s something much larger than 0, we know it is very unlikely for a bunch of unrelated random numbers to have a big correlation (by pure chance) so we conclude that our assumption was probably wrong. The statistical test gives us a way of quantifying exactly how unlikely this was.

If we change the variables from height and weight to intelligence and parents’ income, the math doesn’t change. We still make the assumption that intelligence and parental income are unrelated and then see how correlated intelligence and parents income are. If they were independent if each other, the correlation would probably be close to 0, and and a correlation far from 0 is very unlikely. The likelihood that independent random numbers end up correlated doesn’t depend on if the variables are biological, sociological, physical, etc.

Scientists are usually cautious to specifically not use the word “prove.” Statistical significance levels are generally sort of arbitrary, but a P value of 0.05 indicates that there is a 5% chance that the results are due to chance alone rather than the experimental manipulation. There is always the chance of Type I and Type II error (false positive/negatives) which is why replication is important.

>For example we might look at correlation and get a significant positive correlation between two variables. Given that variables can be literally anything in question, how does doing a few statistical calculations determine it is significant?

I don’t think anyone is answering this, but it’s because you’re assuming that the average of both random variables are following a normal distribution. This is because of the Central Limit Theorem which says that as your sample gets arbitrarily large, the probability distribution of the average of all random variables will follow a normal distribution, even if the variables themselves aren’t normally distributed.

So, for example, flipping a coin is 50% heads or 50% tails. The probability of a single coin flip isn’t normally distributed at all, and follows a binomial distribution. However, if you flip 10000 coins and count how many heads you get, you will find that it follows a bell-curve. 50% of the time you’ll get 5000 heads or less, 68% you’ll get a value within 1 standard deviation of 5000, etc. And this works with ANYTHING from ANY Probability distribution as long as n gets large enough. So for real life problems we don’t actually need to know the probability distribution of single events as long as we take a large enough sample size! (well, in general, I know there are other normality tests you can do but I’m not getting into that)

So what you’re doing with a p-test is taking a sample and assuming it’s normal, then comparing that to the theoretical normal distribution with a given mean/standard deviation. The p-test shows the probability that your data DOESN’T follow that hypothetical normal distribution with the given mean, and instead follows some other (normal) distribution with a different mean.

How I learned to understand significance is that P-value = probability value.

It’s the probability of the effect your stats test is measuring is due to random variation instead of causality.

In biology variation is the name of the game, so it’s important to know the odds of what you’re seeing is due to variation.

We accept under 0.05 as significant, because 5% chance of random vs 95% not was considered acceptable. But no matter how small the p-value is, there is always a chance it’s random and not significant.

>Given that variables can be literally anything in question, how does doing a few statistical calculations determine it is significant?

> I always thought there must be more nuance as the actual variables can be so many different things.

I think this is the problem that is missing from many answers.

You always have to MODEL the variables somehow. That is, make some mathematical assumptions about how these variables work. This allows you to analyze the relationship between the variables. If your model is wrong, the work is useless.

If you’re undergraduate, they probably just skip this part because it would require them to teach you math and statistics. Instead, they just give you some formulas based on model they already had in mind. It’s not just undergraduate, even actual researchers can be amazingly bad at statistics. It’s hard to tell how many researchers are bad at statistics, and how many are just downright fraud, but it’s a problem in science.

So it’s important to make a model generic enough that it’s unlikely to be wrong, but specific enough that you can analyze it. On one end, you have Fischer’s type of tests, which are mathematically simple and the maths had been understood since the time of Gauss, but very simplistic and need you to make a lot of assumptions. On the other end, you have all these newfangled deep learning network, in which nobody know how they work, but are a lot more generic.

Once you have a model, you can mathematically analyze to see what kind of data you can get. If you haven’t done this, it’s not possible to quantitatively see if something is significant. In fact, it is quite possible to get amazingly useless result because the analysis is done poorly. This is a huge problem with researches, especially things like nutritional sciences, psychology, sociology.

As others have noted, “significance” in the sense of p-values just refers to how often you would observe some data pattern from random chance alone as a sample of the total possibilities of the system you are observing. It doesn’t prove anything in and of itself – all a p-value can do is say that there is a pattern that seems to *disprove* the null hypothesis (i.e. your data arose from random chance alone).

It is important to set this up and use it correctly: if you are testing lots of things randomly, you need to account for that properly in the stats test you use, eg applying an F- rather than multiple t-tests, or using a Bonferroni correction where you divide any observed p-value by the m # of hypotheses you are testing. Otherwise you are just *cherry-picking* i.e. throwing stuff at the wall for what sticks, without really explaining or learning anything about stickiness.

Separately…

It’s worth noting that the mechanics of specifying a null hypothesis, “significance”, and the meaning of p-values under a “frequentist” paradigm are not intuitive to most humans at all.

The math has historically been trickier to calculate, but with modern computing, “Bayesian stats” paradigms are easier to understand. It is simply about the level of confidence I have that something is true or not, and I can use that paradigm to synthesize lots of evidence from different previous study designs and setups, as long as I have accurate figures and confidence in random sampling from each.

In real life (and science) we use prior knowledge and theory all the time.

If I am walking along a dark street at night and I see a jewelry store that has a broken window and merchandise strewn about, I can be confident enough that a robbery has taken place to call the police. I can triangulate from other knowledge without needing to have randomly seen the exact same scene many times before.

The root problem is that “significant” has a common-language definition and a stats definition that aren’t the same. In your post, you’re using it like the common usage meaning “substantial” or “important”.

In stats-language, “significant” just means “less than ___% chance this result is a random coincidence”, where ___% is *whatever P value threshold you choose* to use.

If you decide to use P of 0.05 as your cutoff, that means a 95% chance the effect/result is real, 5% chance that it’s a random coincidence. So if you get P < 0.05, it means “less than 5% chance this is random coincidence”.

But you could just as easily choose to use P = 0.4 as your cutoff. Then let’s say you do the calculation and find some effect has P = 0.3. That P is smaller than your chosen “threshold of significance”, so by definition that effect is “significant” (by the stats definition) *even though you just showed it has a 30% chance of being a random coincidence*.

&#x200B;

>How do these tests actually know the data is significant

They don’t – that’s a misuse of the common meaning of “significant”. **Calculating P only tells you a probability of an observed effect being due to random coincidence, and you decide (by your choice of critical P, often 0.05 by convention) at what probability of false alarm you’re willing to call the result “significant”.**