How do statistical tests prove significance?

647 views

I did a biology undergraduate degree and often did reports where would statistically analyse our results. P value of less than 0.05 shows that the results are statistically significant. How do these tests actually know the data is significant? For example we might look at correlation and get a significant positive correlation between two variables. Given that variables can be literally anything in question, how does doing a few statistical calculations determine it is significant? I always thought there must be more nuance as the actual variables can be so many different things. It might show me a significant relationship for two sociological variables and also for two mathematical, when those variables are so different?

In: Mathematics

17 Answers

Anonymous 0 Comments

The root problem is that “significant” has a common-language definition and a stats definition that aren’t the same. In your post, you’re using it like the common usage meaning “substantial” or “important”.

In stats-language, “significant” just means “less than ___% chance this result is a random coincidence”, where ___% is *whatever P value threshold you choose* to use.

If you decide to use P of 0.05 as your cutoff, that means a 95% chance the effect/result is real, 5% chance that it’s a random coincidence. So if you get P < 0.05, it means “less than 5% chance this is random coincidence”.

But you could just as easily choose to use P = 0.4 as your cutoff. Then let’s say you do the calculation and find some effect has P = 0.3. That P is smaller than your chosen “threshold of significance”, so by definition that effect is “significant” (by the stats definition) *even though you just showed it has a 30% chance of being a random coincidence*.

>How do these tests actually know the data is significant

They don’t – that’s a misuse of the common meaning of “significant”. **Calculating P only tells you a probability of an observed effect being due to random coincidence, and you decide (by your choice of critical P, often 0.05 by convention) at what probability of false alarm you’re willing to call the result “significant”.**

Anonymous 0 Comments

As others have noted, “significance” in the sense of p-values just refers to how often you would observe some data pattern from random chance alone as a sample of the total possibilities of the system you are observing. It doesn’t prove anything in and of itself – all a p-value can do is say that there is a pattern that seems to *disprove* the null hypothesis (i.e. your data arose from random chance alone).

It is important to set this up and use it correctly: if you are testing lots of things randomly, you need to account for that properly in the stats test you use, eg applying an F- rather than multiple t-tests, or using a Bonferroni correction where you divide any observed p-value by the m # of hypotheses you are testing. Otherwise you are just *cherry-picking* i.e. throwing stuff at the wall for what sticks, without really explaining or learning anything about stickiness.

Separately…

It’s worth noting that the mechanics of specifying a null hypothesis, “significance”, and the meaning of p-values under a “frequentist” paradigm are not intuitive to most humans at all.

The math has historically been trickier to calculate, but with modern computing, “Bayesian stats” paradigms are easier to understand. It is simply about the level of confidence I have that something is true or not, and I can use that paradigm to synthesize lots of evidence from different previous study designs and setups, as long as I have accurate figures and confidence in random sampling from each.

In real life (and science) we use prior knowledge and theory all the time.

If I am walking along a dark street at night and I see a jewelry store that has a broken window and merchandise strewn about, I can be confident enough that a robbery has taken place to call the police. I can triangulate from other knowledge without needing to have randomly seen the exact same scene many times before.

Anonymous 0 Comments

>Given that variables can be literally anything in question, how does doing a few statistical calculations determine it is significant?

> I always thought there must be more nuance as the actual variables can be so many different things.

I think this is the problem that is missing from many answers.

You always have to MODEL the variables somehow. That is, make some mathematical assumptions about how these variables work. This allows you to analyze the relationship between the variables. If your model is wrong, the work is useless.

If you’re undergraduate, they probably just skip this part because it would require them to teach you math and statistics. Instead, they just give you some formulas based on model they already had in mind. It’s not just undergraduate, even actual researchers can be amazingly bad at statistics. It’s hard to tell how many researchers are bad at statistics, and how many are just downright fraud, but it’s a problem in science.

So it’s important to make a model generic enough that it’s unlikely to be wrong, but specific enough that you can analyze it. On one end, you have Fischer’s type of tests, which are mathematically simple and the maths had been understood since the time of Gauss, but very simplistic and need you to make a lot of assumptions. On the other end, you have all these newfangled deep learning network, in which nobody know how they work, but are a lot more generic.

Once you have a model, you can mathematically analyze to see what kind of data you can get. If you haven’t done this, it’s not possible to quantitatively see if something is significant. In fact, it is quite possible to get amazingly useless result because the analysis is done poorly. This is a huge problem with researches, especially things like nutritional sciences, psychology, sociology.

Anonymous 0 Comments

How I learned to understand significance is that P-value = probability value.

It’s the probability of the effect your stats test is measuring is due to random variation instead of causality.

In biology variation is the name of the game, so it’s important to know the odds of what you’re seeing is due to variation.

We accept under 0.05 as significant, because 5% chance of random vs 95% not was considered acceptable. But no matter how small the p-value is, there is always a chance it’s random and not significant.

Anonymous 0 Comments

>For example we might look at correlation and get a significant positive correlation between two variables. Given that variables can be literally anything in question, how does doing a few statistical calculations determine it is significant?

I don’t think anyone is answering this, but it’s because you’re assuming that the average of both random variables are following a normal distribution. This is because of the Central Limit Theorem which says that as your sample gets arbitrarily large, the probability distribution of the average of all random variables will follow a normal distribution, even if the variables themselves aren’t normally distributed.

So, for example, flipping a coin is 50% heads or 50% tails. The probability of a single coin flip isn’t normally distributed at all, and follows a binomial distribution. However, if you flip 10000 coins and count how many heads you get, you will find that it follows a bell-curve. 50% of the time you’ll get 5000 heads or less, 68% you’ll get a value within 1 standard deviation of 5000, etc. And this works with ANYTHING from ANY Probability distribution as long as n gets large enough. So for real life problems we don’t actually need to know the probability distribution of single events as long as we take a large enough sample size! (well, in general, I know there are other normality tests you can do but I’m not getting into that)

So what you’re doing with a p-test is taking a sample and assuming it’s normal, then comparing that to the theoretical normal distribution with a given mean/standard deviation. The p-test shows the probability that your data DOESN’T follow that hypothetical normal distribution with the given mean, and instead follows some other (normal) distribution with a different mean.

Anonymous 0 Comments

Scientists are usually cautious to specifically not use the word “prove.” Statistical significance levels are generally sort of arbitrary, but a P value of 0.05 indicates that there is a 5% chance that the results are due to chance alone rather than the experimental manipulation. There is always the chance of Type I and Type II error (false positive/negatives) which is why replication is important.

Anonymous 0 Comments

Generally, a statistical test can show that the probability that something happens under a certain assumption is very low (often, lower than 0.05 or 5% is used as a threshold). This generally gives evidence that the assumption is wrong. For example, I can flip a coin and I can make the assumption that the coin is fair (heads and tails are equally likely). If I flip it and it comes up heads 19 times out of 20, I can reason “if this coin were fair, then the chance of getting 19 heads is extremely small, so this coin is probably not fair”.

We can do the same thing for correlation. If we have a bunch of observations and each observation contains two variables (say, we measure the height and weight of a bunch of people), we can make the “assumption” that height and weight are not related to each other, aka, independent. Under this assumption, it is highly likely that the correlation coefficient will be quite close to 0. If we compute the correlation and find that it’s something much larger than 0, we know it is very unlikely for a bunch of unrelated random numbers to have a big correlation (by pure chance) so we conclude that our assumption was probably wrong. The statistical test gives us a way of quantifying exactly how unlikely this was.

If we change the variables from height and weight to intelligence and parents’ income, the math doesn’t change. We still make the assumption that intelligence and parental income are unrelated and then see how correlated intelligence and parents income are. If they were independent if each other, the correlation would probably be close to 0, and and a correlation far from 0 is very unlikely. The likelihood that independent random numbers end up correlated doesn’t depend on if the variables are biological, sociological, physical, etc.

Anonymous 0 Comments

This book [https://www.amazon.com/How-Not-Be-Wrong-Mathematical/dp/0143127535](https://www.amazon.com/How-Not-Be-Wrong-Mathematical/dp/0143127535) has a very good explanation of the history and interpretations of “significance.”

Anonymous 0 Comments

What you’re getting at is that “significance” doesn’t really mean “significance.”

A better term for “statistical significance” is “statistical discernibility.” You measure some X and Y 100 times and find a correlation of 0.42 with some standard error. Then you ask, “Okay, if the true correlation between X and Y were zero, how hard would it be to draw a sample of 100 with a sample correlation of 0.42 or more extreme?” The answer to that is your p-value. If the p-value is low, you’re saying “We can’t ever know the exact true correlation, but we can be very confident that it isn’t zero.” You’re saying your result can be discerned or distinguished from zero.

But, statistical significance doesn’t mean that it substantively matters. That’s a matter for effect size and the confidence interval around it. Suppose you’re researching the effects of eating blueberries on human longevity, and find a statistically discernible effect. If that effect is “You would have to eat the entire mass of the earth in blueberries every year to extend your life by one month,” it doesn’t really matter even if the p-value is 0.0000000001.

Statistical significance also doesn’t mean causality; the usual examples here are Tyler Vigen’s spurious correlations. X and Y can go together because:

* X causes Y
* Y causes X
* They both cause each other simultaneously
* There’s some Z that causes X and Y
* Other stuff I’m forgetting
* For literally no reason at all

Figuring out causality is, mostly, a research design question and not a statistics question. There are circumstances where causality is relatively straightforward statistically, but you have to be able to perform true experiments or have to luck into the right kind of data being available.

When you can’t do a true experiment and you don’t have that lucky kind of data, what you mostly do is ask “What would the world look like if I were right? What would it look like if I were wrong?” If you’re right, more X goes with more Y, and more A goes with less B, and so on. What you’d like to do here is have a whole set of things to look at, some of which are weird or surprising. You can see a lot of this with early covid or other epidemiological study — if it’s spread by air, then we should see these relationships between variables, but if it’s spread by droplets, we should see other relationships that we wouldn’t see if it were by air, and if it’s spread by contaminated water we should see yet other relationships.

Anonymous 0 Comments

Short eli5 answer: P Value is “Assuming our hypothesis is wrong, what are the odds that we got this result by chance?”

It’s not proving or disproving anything, including a relationship between two variables. All it’s doing is saying that assuming our hypothesis is wrong (aka null hypothesis/status quo is ‘true’), you are (P Value*100) percent likely to see the result we got.