Statistical Significance vs. non-significant


What exactly does it mean when a result is statistically significant vs. insignificant? When we compare, for example, a t-stat and the critical t-value, I know we either reject or fail to reject the null hypothesis based on whether the t-stat is less than or greater than the t-value. What exactly does it mean when the t-stat is greater than the critical t-value? What even is the “t-stat” and “critical t-value” in layman terms?

After doing enough problems, I’m sure I’ll get it, but I don’t like _not_ being able to explain this to myself simply – which indicates that I haven’t understood it well enough. Can someone please dumb all of this down for me and truly explain it to me like I’m a child?

In: Mathematics

Nonsignificant – could just be noise. X and Y can’t be said to be different.

Statistically significant – can be said to be different. Might or might not be a big difference, but X and Y are not the same.

Probably better explanations out there.

How likely is it that your results are from dumb luck alone? If your flipping a coin it’s possible to get head a few times in a row normally, the more flips you see heads the more unlikely or statistically significant it is.

At it’s most basic form it’s that you have some results for something (say someone’s age and whether they voted for Biden or Trump). Now a trend you might see in those results is that older people were more likely to vote for Trump and younger people were more likely to vote for Biden. So you might make a Null hypothesis to test this. The null hypothesis is always negative, as if there’s no real trend so might be “age has no effect on how a person votes”.

If you test this out statistically, it will be statistically significant if the trend is not just random. That age actually does affect how a person votes. In that case, you can reject the null hypothesis and say there’s a true link between age and voting pattern.

The t-test is just a way to measure how different your results are to what you would expect if there was no real trend at all. If the t-value is above a critical value then the difference between your measured results and what the results would look like if things were random is big enough to be confident that your trend is statistically significant.

Statistically, “significant” means that the data supports rejecting the null hypothesis and that the the chances are very low that it’s a coincidence. “Insignificant” means either that the data supports the null hypothesis, or that it’s weak enough that it might be a coincidence.

Since coin-flipping is the go-to situation for probability, consider the following situation: You’re told that a coin is a fair coin (null hypothesis), but you’re not allowed to hold it or look at it. Someone repeatedly flips it for you, and the result is heads every time. At what point do you decide that you were lied to, and the coin actually has heads on both sides? 5 flips? 10? 100?

IRL you could, of course, look at the coin to see if there’s a tails side. But if, for example, you wind up with 99 heads and 1 tails, you can be pretty sure the coin is biased, whereas 5 heads and 1 tails wouldn’t do it because that could easily happen with a fair coin. Somewhere in between is the line between significance and insignificance, and that (I think) is what the critical t-value represents.

I think this is what you’re getting at, I may be answering the wrong thing though: basically if it is found to be significant it means that in almost 100% of the cases X leads to or is directly causing Y. The t stat is just a way of quantifying the relationship of x and y for all of your hundreds or thousands of data points into one number. The math to get there is obviously complicated since its taking so many x/y pairs and reducing them all down to one number. So now we have a number representing the relationship, but how do we tell if its a random relationship or a causative one? We do that by proving that it ISN’T random, rather than proving that it IS causative. The critical value is then just the cutoff value on a standard bell curve (of all the possible relationships between two numbers) so that only 5% of the area is past that in the tiny little tip. If your t stat (number that represents the relationship of your two variables) is higher than the critical value (number that excludes 95% of all the known random relationships) then you can say that it is NOT a random relationship.

I hope this helped, that’s how I think of it at least.

It’s a really bad term used colloquially.

Okay, in statistics we have a p value. This basically means the percent chance that what you thought were results that confirmed your hypothesis are actually the results you’d get even if your hypothesis were completely wrong (the null hypothesis). Normally .05, or 5%, is the number used to reject the null hypothesis, to claim “statistically significant.”

Unfortunately, the term is abused to say “We were right!” No, it doesn’t mean that. It just means there’s a less than 5% chance you were completely wrong. P can be set at any number, and studies do go lower than .05. There are many more things about a study that can give you confidence, or take it away, than the p value.

In fact, there’s a thing called p-hacking, which means massaging the data to get that P value down below .05 although your hypothesis really is rubbish, crafted to get the answer you want.

tl;dr: the critical value is a benchmark, and the test statistic is the number you compare to the benchmark. the test statistic is calculated from the data you observed, and you’re assuming the data comes from some assumed distribution. you’re comparing the test statistic to the bench mark to see if there is evidence that your assumption is false. “statistical significance” is then related to the benchmark you choose.


To answer the questions of describing these concepts in layman’s terms, assuming you’ve done a few introductory lessons already:

* The “t-stat” is a number that’s the result of putting your data into an equation based on the **t-distribution**.
* The “critical t-value” is a value that is your benchmark for a certain probability that you would observe the data **if** your null hypothesis was true based on the **t-distribution**

If you reject the null hypothesis this is what you’re saying:

> The probability of observing the data, if we assume the data follows the t distribution, is *so low* that we feel our assumption is false. We have evidence that the data follows some *other* distribution.


Analogy/example to hopefully clear up the *methodology* of this stuff: Let’s say you’re a basketball coach, and you’re scouting a player. Someone told you this particular player is a great ball-handler, so that’s your initial assumption when watching. Now you have to choose a benchmark to decide whether or not that assumption is wrong or not. Let’s say our benchmark is “losing the ball 3 times”. If he loses the ball 4 times, your assumption was wrong, your scout lied to you and this guy is not a great ball-handler.

Why 3? No profound reason, really. You think it’s a fair number. This could be 2 or 4 or 5. It’s whatever you want it to be (this is analogous to choosing your “significance level”). You probably wouldn’t choose 1 –maybe the player would just be unlucky on a play. But you feel 3 is a fair number so you don’t write off the player too soon. If by the end of the game, the player only loses the ball once, that would be statistically insignificant — you still think he’s a great ball handler. If he loses the ball 10 times, yeah that’s some *statistically significant* evidence that he’s not as great as you thought.

the above example was off the cuff and just to provide a different context for these basic concepts that maybe you wouldn’t have seen yet.


Now going past introductory statistics classes, it’s a recent development to move away from *having* to reject or fail to reject the null hypothesis due to rampant abuse and misunderstanding.

You don’t *have* to reject or fail to reject anything. Practically, is there a difference between a p-value of 0.499999999999999 and 0.500000001? I’d argue, in most cases, that’s a flat out no.


Statistical significance means that **if the null hypothesis were true**, the chances of observing a difference as (or more) extreme as what we’re seeing are so low that we’re pretty sure we can correctly reject the null.

Say you’re testing a die to see if it’s fair or loaded. The null hypothesis is that it’s fair, so you would expect to see the numbers 1-6 show up in roughly equal proportions across multiple rolls.

As an extreme example, say you roll it 100 times and get a 6 every time. Obviously the chances of rolling the same number 100/100 times on a fair die are ridiculously low, so we can confidently reject the null hypothesis that this is a fair die.

But what if you couldn’t roll the die 100 times? What if you could only roll it 10 times? This is analogous to doing a study with limited resources and only being able to recruit a sample size of so many people.

If you rolled a six 10/10 times, you could still be pretty sure it’s loaded, but not as sure as if it were 100/100 times. If you only had 5 rolls and rolled a six every time, you might suspect it’s loaded, but you probably wouldn’t bet your life on it. At two rolls, it’s impossible to draw a conclusion, since you can very easily roll the same number twice in a row by chance.

The critical value, aka the significance threshold, is just a mathematical way of establishing when you would feel confident rejecting the null, based on the number of rolls you have and the outcome you observed.

For any outcome we observe over a number of rolls, we can calculate the probability of getting that outcome on a fair die, since we know the mean and SD of a fair die (mean = 3.5 and SD = 1.7). **This is the test statistic/t-stat**.

We can then set our **critical value/significance threshold**, being the probability of an outcome occurring beyond which we would be confident in rejecting the null. In science, this is most often 95%. It’s an arbitrary value, but is commonly used by convention. So in this case, we can say that if the outcome we observed had less than a 5% probability of occurring by chance on a fair die, it’s statistically significant, and we’ll reject the null.

As mentioned, our ability to draw these conclusions depends on the sample size. For example, if you only had two rolls, any possible of outcomes would lie within that 95% range. In order to reach statistical significance, you would have to observe a mean value between the two rolls that is either less than 1 or greater than 6; both of which are impossible on a 6-sided die. Therefore there is no possibility of reaching statistical significance with a sample that small. As you increase the sample size, you increase the power of the study to reach statistical significance.