How do statistical tests prove significance?

712 views

I did a biology undergraduate degree and often did reports where would statistically analyse our results. P value of less than 0.05 shows that the results are statistically significant. How do these tests actually know the data is significant? For example we might look at correlation and get a significant positive correlation between two variables. Given that variables can be literally anything in question, how does doing a few statistical calculations determine it is significant? I always thought there must be more nuance as the actual variables can be so many different things. It might show me a significant relationship for two sociological variables and also for two mathematical, when those variables are so different?

In: Mathematics

17 Answers

Anonymous 0 Comments

> For example we might look at correlation and get a significant positive correlation between two variables. Given that variables can be literally anything in question, how does doing a few statistical calculations determine it is significant?

You’re doing it wrong.

You start with a null hypothesis. This is the thing you want to show is false.

For example, you think that people who eat more apples also eat more pears. So your null hypothesis is that people eat the same number of pears regardless of how many apples they eat (no correlation). Then you go get data and test it.

But people also eat plums, and that might affect whether they eat pears! So you include plum eating in the formula.

If you find a correlation between eating pears and eating plums with a P value of less than 0.05, is that statistically significant?

No, it is not. Why? Because that is only your hypothesis BECAUSE you found a correlation, which biases your results. You might have had 20 different fruit you were correcting for, which would mean odds are at least one would have a correlation – even if there was no correlation at all.

Doing random P testing on things you don’t think to have correlation is simply wrong. You might well do that at a preliminary stage to find out what hypothesis to test in the first place, but the data you use to come up with the hypothesis cannot be the same data you use to test it.

To see why, imagine a man who sees a coin tossed 4 times. Each time it comes up with heads. He thinks, maybe the coin is biased. He then tests that by using his observations of the coin coming up heads 4 times, and, low and behold, the data backs it up – this coin is not a fair coin! That’s crazy, right?

> Given that variables can be literally anything in question, how does doing a few statistical calculations determine it is significant?

Basically, P values tell you, “what are the odds we’d get this if our null hypothesis was true?”. If it’s 0.05, that suggests but does not prove that your null hypothesis is false. Go do more testing.

Anonymous 0 Comments

Basically you come up with the “null hypothesis” which is close to saying “how likely is this to happen by chance?”

So lets say I claim I can toss a coin to land however I want, most of the time. How do you test this?

Say I toss 4 heads and 1 tails while trying to make it always heads. The default is that it’s 50:50 and I have no effect. So how likely is it to toss at least 4 heads in 5 tosses? 0.1875

So we’d say that wasn’t significant at the 5% level. Since it would happen 19% of the time by chance!

Now if I did 11 heads and 1 tails, that would be 0.00317382812, which is 3% and so < 5% which is commonly used as the significance threshold. (Although that’s quite arbitrary, you can choose any threshold before you start.)

What this calculation doesn’t do is tell you how effectively I can control it being a head. Just that I can deviate from the normal result enough that I can produce otherwise unlikely events.

Anonymous 0 Comments

statistical test cannot “prove” significance. In fact, you cannot prove statistical significance at all, you can only measure it. There are many techniques to measure it, but then you usually get a couple of numbers. Usually, most important number is the p-value, (other numbers include size effects).

p-value stands for probability. It measures the probability that nothing happened, but you got good results anyway due to luck.

For example. Let’s say I claim to be a psychic who could control chaos magic like Wanda and determine the result of a coin toss. How many heads in a row is enough to convince you that I am a psychic?

If I throw 2 heads in a row, you might just call me lucky. If I throw 5 heads in a row, you might think that I might be up to something. If I throw like 20 heads in a row, I will definitely get your interest. Either I got an excellent throwing techniques, or there are tricks in the coin, or I’m a real psychic, but you would be pretty sure it is not up to chance.

So maybe for you the limit is somewhere between 5 and 20 coin tosses. If you do the statistical test, the P-value of 5-head is 0.03125, while for 20-head is 9.5e-7.

Now, the same with biology, let’s say testing if a medicine is working. How do we know if a medicine is working, or is just up to luck?

*************************************

Well, you wanna find out the p-value. to do that, you use one of the many statistical tests. These are tools that people can misuse and abuse. In fact, it is kinda hard to get it right.

And then you get a p-value. Different fields have different standard. It seems that you are familiar with p<0.05, which is 1 in 20 chance that it is luck. Other fields use p< 5sigmas, which translates to 1 in a million chance.

https://news.mit.edu/2012/explained-sigma-0209

Anonymous 0 Comments

When we find a relation with a small p-value we are essentially saying there is a small chance that the relation is due to random chance. This then allows us to accept a hypothesis as a certain level of confidence.

When using p-values and other statistical methods you are looking to either accept or reject a hypothesis that you have created. Frequently, we use the null and alternative hypothesizes for simplicity. When creating a testable hypothesis it needs to pass the sniff test. Historically, the butter production in various countries has had a an significant relation to the returns of the S&P 500. This doesn’t mean that the relation is neccesarily true though.

With the ridiculous number of possible relations in our data rich world there will be significant relations between variables that make no sense. The probability of getting 10 heads in a row is incredible small but in a set of 100,000 flips its actually fairly likely. The way to get around this is either using common sense in data and relation selection or to find significant relations in comparable data.

The correlation between worldwide non-commercial space launches and Sociology doctorates per year is high but does it really mean anything? Maybe if space launches correlate with science funding and total doctorates also increase with global science funding? Maybe the US has the majority of space launches and so it makes more sense. A high probability does not imply truth.

https://blog.psyquation.com/es/correlation-with-a-twist/

Anonymous 0 Comments

“Significance” in this context doesn’t mean “this is true”, it means “The chance this is true is pretty damn high”. Generally speaking, the stronger the correlation, the more true it’s likely to be. The P value is essentially a value of how likely it is that the results you got were just a fluke – that there’s no pattern at all and the data just happened to come out looking like there was. The tests that determine P value just look at the data in the abstract and the amount it deviates. The lower the deviation, the lower the P value, because it’s very unusual for random chance to produce results with very low deviation. It could still happen, which is why the P value isn’t 0, all you’re doing is saying “The chance random chance produced *these* results is sufficiently low that we can decide the correlation is significant and therefore reproducible”.

Also, there are cases where a P value of 0.05 is still too high to be confident the correlation is actually there. In some cases, the results won’t be considered significant until the P value is 0.01, or even lower.

Anonymous 0 Comments

The basic idea is that when doing an experiment, you get a certain value with a certain probability, which can be described by a probability function. Most often, it’s assumed that your probability function is the normal probability distribution, aka the bell curve.

Then you fit 2 such curves to each of your observation sequences (e.g. patients that received a drug, and those who received a placebo), and look at how much these bell curves overlap. The less the overlap, the lower the probability that it’s just different by random chance.

Anonymous 0 Comments

This book [https://www.amazon.com/How-Not-Be-Wrong-Mathematical/dp/0143127535](https://www.amazon.com/How-Not-Be-Wrong-Mathematical/dp/0143127535) has a very good explanation of the history and interpretations of “significance.”

Anonymous 0 Comments

Short eli5 answer: P Value is “Assuming our hypothesis is wrong, what are the odds that we got this result by chance?”

It’s not proving or disproving anything, including a relationship between two variables. All it’s doing is saying that assuming our hypothesis is wrong (aka null hypothesis/status quo is ‘true’), you are (P Value*100) percent likely to see the result we got.

Anonymous 0 Comments

What you’re getting at is that “significance” doesn’t really mean “significance.”

A better term for “statistical significance” is “statistical discernibility.” You measure some X and Y 100 times and find a correlation of 0.42 with some standard error. Then you ask, “Okay, if the true correlation between X and Y were zero, how hard would it be to draw a sample of 100 with a sample correlation of 0.42 or more extreme?” The answer to that is your p-value. If the p-value is low, you’re saying “We can’t ever know the exact true correlation, but we can be very confident that it isn’t zero.” You’re saying your result can be discerned or distinguished from zero.

But, statistical significance doesn’t mean that it substantively matters. That’s a matter for effect size and the confidence interval around it. Suppose you’re researching the effects of eating blueberries on human longevity, and find a statistically discernible effect. If that effect is “You would have to eat the entire mass of the earth in blueberries every year to extend your life by one month,” it doesn’t really matter even if the p-value is 0.0000000001.

Statistical significance also doesn’t mean causality; the usual examples here are Tyler Vigen’s spurious correlations. X and Y can go together because:

* X causes Y
* Y causes X
* They both cause each other simultaneously
* There’s some Z that causes X and Y
* Other stuff I’m forgetting
* For literally no reason at all

Figuring out causality is, mostly, a research design question and not a statistics question. There are circumstances where causality is relatively straightforward statistically, but you have to be able to perform true experiments or have to luck into the right kind of data being available.

When you can’t do a true experiment and you don’t have that lucky kind of data, what you mostly do is ask “What would the world look like if I were right? What would it look like if I were wrong?” If you’re right, more X goes with more Y, and more A goes with less B, and so on. What you’d like to do here is have a whole set of things to look at, some of which are weird or surprising. You can see a lot of this with early covid or other epidemiological study — if it’s spread by air, then we should see these relationships between variables, but if it’s spread by droplets, we should see other relationships that we wouldn’t see if it were by air, and if it’s spread by contaminated water we should see yet other relationships.

Anonymous 0 Comments

>For example we might look at correlation and get a significant positive correlation between two variables. Given that variables can be literally anything in question, how does doing a few statistical calculations determine it is significant?

I think you’ve been thrown by one of the (many) confusing things about statistical significance. It took it a while for this to click with me.

A test of significance has *nothing* to do with what you’re actually measuring. Figuring out the relationship between two variables is about picking the right test of correlation (and ensuring the logic of such a relationship).

Significance testing is about *sampling*. (Which in this case could also be repeating an experiment.)

Imagine I have two bags filled with loads of red and blue balls. I want to know if the proportions are different between the two bags, so I pull out a random sample from each bag.

Now imagine I have two big groups of men and women. I want to know if the proportions are the same between the two groups, so I draw a random sample from each.

I’m looking at completely different things, but the statistics of *sampling* are the same.

Now actually doing significance testing right is rather more complex than that. To start with you need to decide what level of significance you need. Mathematics can’t tell you that – “what level of risk am I willing to take that my results are down to chance?” A 95% confidence level is just a convention, and as another commenter has said, the convention is not always 95%.

Also, using significance tests assume the null hypothesis is true. This leads to a problem like screening tests – if most people don’t have a disease then you’ll get a lot more false positives than if most people do have a disease. In a similar way, if a hypothesis is unlikely to be true, it’s more likely you’ll get a false positive than if it’s likely to be true. But factoring that in means making a guess at how likely your hypothesis is to be true, which takes us into the contentious world of Bayesian statistics.

“The problem is that there is near unanimity among statisticians that p values don’t tell you what you need to know but statisticians themselves haven’t been able to agree on a better way of doing things.”

That last bit is probably a bit complex for ELI5… I might be able to explain it better if anyone wants. (Or [here](http://www.dcscience.net/2020/10/18/why-p-values-cant-tell-you-what-you-need-to-know-and-what-to-do-about-it/) is a more technical explanation for anyone who wants that.)