The P value is how likely your result could be from chance rather than the effect of whatever you are studying. The lower it is, the more confident you can be that your study has produced something meaningful.
P-hacking is the process of selecting some data and discarding others to get your P value as low as possible (generally to 0.05 or below).
There’s an interesting tool at https://projects.fivethirtyeight.com/p-hacking/ to show how it works. You can “prove” that either party is statistically better at handling the economy by selecting which variables you use.
“p-hacking” is a term used to describe intentional misuse of data to try and push a certain point or narrative.
Basically, say you collect 1,000 data points in a study, the first 500 don’t really support your idea, but the second 500 do. Well, if you want to publish an article about your point, what if you simply ignore the first 500 data points, never tell anyone about them, and only include the second 500. Boom now it seems like your idea is supported.
That is a really simple example of manipulating your data where overall the data doesn’t tell you anything significant, but you can make it look like it does.
And there are like half a dozen other ways to kind of achieve this.
Scientific findings are based on data, typically a small sample of all of the data it would be possible to collect. The eternal anxiety in science is that if you had, by chance, collected a slightly different dataset, you may have found a different results or no interesting result at all.
In rides Statistics to the rescue. It provides a way of saying how likely that scenario is. In particular, it can provide a “p value” which is the probability that a given result is big simply by chance. By convention, a result is “statistically significant” if this probability is less than 1 in 20.
P-hacking is the practice of abusing this convention. The simplest way would be to run the same experiment 20 times (or more). Even if the hypothesis of the experiment is false, this gives you a decent chance that one of the experiments will turn up a statistically significant result simply by chance. More subtly, you can test 20 different hypotheses. Odds are at least one will be statistically significant, and you can make that the focus of the paper.
In an era of big data and fast computers, p-hacking has become common and easy. There’s always a different subset of the data, outcome, or covariate specification to test, and it’s quick to code and run those statistical tests all at once. A red flag for p-hacking is when a paper has results that are “spotty” in that only certain specifications see (often contradictory) effects for no good reason. e.g. “Chemical X was found to induce hair growth in lactating mothers but hair loss in men over 60.” A good defense against p-hacking is asking researchers to “pre-register” what hypotheses they will test and why *before* they get the data, then only accepting papers that adhere to those pre-determined plans.
It’s like shooting at a target with a shotgun with hundreds of tiny pellets and showing people the one pellet that hit the target, in spite of your terrible aim. P hacking would be paying attention to how likely it was for that specific pellet to hit the target, rather than paying attention to how likely any pellet was to hit the target.
It’s a way of interpreting data that is more likely to find spurious correlations than real causes and effects.
Basically, p-hacking is the science equivalent of going to a firing range and covering the far wall completely with target papers.
You then proceed to fire every round you have blindly down range.
Once all ammo is expended, you go pull down all of the targets, discard anything which has a hole NOT on the bullseye, and collect all the targets where you did hit the bullseye.
Now you present this collection of well-shot targets to somebody and ask to be recognized as a sharpshooter.
A p-hacked result does show actual collected data. But the data was initially collected among one target population… and then it was narrowed down to be from a completely different target population so that the data does show something statistically relevant, when under the initial plan for the study the results were not conclusive.
Like maybe I was thinking that all high school students exhibit X behavior, but when I took data on 3 million high school students across the nation I found that wasn’t showing up in the data, I then start playing with the demographic data also provided. Did all hispanic students exhibit X? How about all students on free and reduced lunch? How about all females? Maybe all 12-14 year old Inuits from a single parent household? Ooh, I am in luck, there are actually only two such people in my data set, and they happen to both exhibit X! Now I proclaim that my “…study of 3 million high school students shows that, for certain populations, behavior X is present…”
Imagine you have a virus that infects 50% of the test mouse population. The first important thing to understand is that there is always a variance. It means that with this virus, if you have 10 mice, you expect that always exactly 5 mice are infected, but in reality some days the mice are lucky and only 4 are infected. Or, very rarely only 3. Some other days 6 or rarely 7 will catch it.
Now let’s assume you have a cure that prevents the infection. But the cure is not 100% efficient, it only saves *some* of the mice. Now you see the same principles apply. Let’s say your cure saves 2 of the 5 infected mice, so now the virus will infect 3 out of 10. But it can be exactly 3 or sometimes just 2 or 4. Variances apply.
Now you try out your cure and treat 10 mice with and 10 without. Let’s say the cure-treated group has 3 infected and the untreated group has 6 infected.
Now the question is the following. How sure you can be that the cure *really* works? Even without the cure, sometimes you get only 3 infected mice. Sometimes you get one lucky cage and one unlucky cage.
The answer is that you never really know for sure that the cure really really worked. However you have mathematical tools to figure out how *unlikely* it is to get a cage of 3 versus a cage of 6. Let’s say your calculations tell you that if you did the experiment 100 times and your cure does nothing, then only 5 times you would get this 3+6 results. That’s your p-value,the 5 out of 100,or 1 out of 100 or whatever.
Long ago scientists agreed that 5% is an okay risk. Which means that although you don’t know for sure that your cure works, you accept it as a risk. Maybe you had a very lucky cage of mice and a very unlucky cage, but you dismiss this option because such a setup is very unlikely. Such unlikely that your cure most probably works.
Now as you see it means that 5 out of a 100 cures actually don’t work. We just don’t know which 5 it is. Because in 5 cases the mice were just simply lucky but it’s impossible to tell apart from the working cure.
So what is p-hacking? Let’s assume you are a scientist but now you have 100 candidate cures for the virus. You want to test which ones work so you take 10 mice for each and in some cases you have 3 uninfected, or 4 or 5 etc. Let’s say you capture 5 of the 100 candidates that worked.
But did they really? As you see, mice are just simply lucky sometimes. Actually if you use pure water as a cure, maybe 5 out of 100 cages would *look like* a cure that works.
Problem is that a lot of scientists don’t understand that doing a mass experiment and taking the 5% that worked is a problem. They may genuinely believe that they captured 5 out of 100 good cures and they don’t realize that this is illegal. You can always have 100 experiments and just have random 5% that seem to work. That’s why when you do a mass experiment, you don’t calculate p-value one by one as if they were independent. Instead you have to combine them and do a different math.
If someone does not combine the calculation and does not tell about the 95 failed experiments and only tells about the 5 successful-looking ones, that is p-hacking. As you see, it is very difficult to tell if someone kept something secret so you never know if a result is genuine or hacked.
In statistical analysis we use a system of 2 hypotheses and probability values. The hypotheses are known as the null and alternate hypothesis. The null hypothesis is simply the statement that IF x is true then there is no pattern. The alternate says IF the null isn’t true then there may be a pattern. The p-value is simply the likelihood we place to the data we record. In theory the data we collect should follow a bell curve. Meaning low values and high values occur less frequently. Out p-value is the line we draw saying “if we have too x amount of our data above or below this, then there’s too much noise to call this a pattern. ”
Typically we set a p-value of .05. meaning that is 95% of all the data we collected falls within our expected range then the alternate hypothesis is true and there may be a pattern. If greater than 5% of the data we collected falls outside our expected range then there is no pattern.
P-hacking is setting an inappropriately high p-value or changing your p-value after the fact to reject your null hypothesis. Or it’s stopping the collection of data at an inappropriate time because your data is beginning to approach your p-value.
It’s academically dishonest and let’s you make claims about patterns that may not be present.
Lots of people are giving you kind of decent answers but are missing some nuance. Let me help. I got an A in stats for my bio degree.
* A p-value is **one tool of many** to quantify the “usefulness” of scientific data.
* In stats, every way of teasing the data around has tradeoffs and assumptions. Certain kinds of data do funny things to any sort of standard formula, and in some cases can totally break it. So it is important to note that statisticians **select tools carefully** based on the data they expect to see (and in some cases, based on the data they actually got). That said, a p-value is like a screwdriver or ratchet more than a saw. You will end up using that tool on lots of jobs. Another very, very common tool you should be familiar with is the **confidence interval**, which is also sometimes used to express whether a study is significant or not.
* p-values are rated against an *alpha* value which is standardized at 0.05. This is why you **constantly** see “(p<0.05)”. But in reality, this is mostly an arbitrary choice of the scientific community. We have apparently collectively decided that a 4.9% chance of error is acceptable and a 5.1% chance is not. In some fields, the alpha is 0.01 because we want to be **really sure**.
* p-values do **not** necessarily track with effect size. You design a drug to lower blood pressure and give it to 10,000 people, setting another 5,000 as a control. 98,000 people have their blood pressure lowered by 2 points compared to control, and the others don’t have any change. Without actually doing the math, that will probably generate a significant result. But do you really care? The effect size is not clinically meaningful. Who would use that drug?
* p-values and confidence intervals only deal with [one of the two kinds of sampling error](https://support.minitab.com/en-us/minitab/21/help-and-how-to/statistics/basic-statistics/supporting-topics/basics/type-i-and-type-ii-error/). Good scientists also do *power analysis* in many cases when choosing their sample sizes, meaning that they are thinking ahead about the relative risk of type I versus type II errors.
* p-values don’t deal with other problems in methodology. If I take bad measurements, my measurements don’t actually mean what I think they mean, I do some math wrong, etc. those totally bypass the p value. A p-value is calculated with the assumption that we’re doing everything elde correct and above board.
* In general, it is not only OK but actually encouraged to carefully plan a study by doing some analysis before even collecting data. p-hacking is generally done **after** data are collected. One way to avoid this is by telling everyone (or a trusted authority monitoring your study) what your plan is ahead of time so that you can’t just change it as soon as your results aren’t to your liking. This is called pre-registration.
[Here](https://blog.minitab.com/en/adventures-in-statistics-2/understanding-hypothesis-tests-significance-levels-alpha-and-p-values-in-statistics) is a good general article about p-values that shows the actual normal distribution curve.
p-value in science refers to how likely (probability) your conclusion is wrong, based on a set of data. Most science field have generally accepted 5% is the threshold for something worthy to publish. Here is an example of p-value. Let’s say you want to prove me a certain coin is unfair at flipping, say 100% head. You then flip it for 5 times and it’s all head. Then your have a claim with p<0.05. Because the coin can still be fair (50-50 head to tail), but you’re just lucky to get 5 head (probability is 1/2^5 = 0.031). So your chance of being wrong is 3.1% (p=0.031).
P-hacking refers to people do seemingly legit things to bring p under 0.05. If we keep using the example above, say you flip the coin (which is actually fair) five times every day and record it. And then only show me the recording of the day you get 5 heads and claims it’s a 100-0 coin with p<0.05.
You might think this is dumb, but in reality it can be really hard to detect. Some time the author even commits it subconsciously. In a lots of cases experiments are conducted in series with continuous improvement of the details. Say you’re testing a new drug in mouse, and failed, and then think: oh hey this batch of mice seems a bit skinnier the usual, let try again with actually well-fed mice. It failed again, and you think, oh maybe I should give the drug with food, so they don’t get stressed by force feeding the drug. And then and then… And finally one day worked. You report the last run, and in methods section you detail all the seemingly unsuspicious things you do about administrating the drug (keep them fed, don’t stress them, ask for God’s forgiveness etc.) Even though in reality none of these actually matters for the drugs efficacy. You’re p-hacking.
Note for math: for the coin flipping example, the math only works if we are living in a world where a coin can only be 100-0 or 50-50, nothing in between.
Say I get the idea in my head that my local casino’s dice are biased. The 1s are coming up way more often than the 6s, I think. So, I ask them for a die and set out to test my hypothesis.
I roll the die 100 times, and damn, each number came up roughly the same number of times. That can’t be right, I KNOW the die is biased!
I roll the die again 100 times, and damn yet again, the numbers come up roughly the same number of times. That…can’t…be right, I know the die is biased!
I roll another 100 times, another 100 times, another 100 times, and then finally… I roll 100 times and 1 comes up 30 times and 6 only comes up twice! Aha! The dice are biased! I knew it!
The die isn’t actually biased, it’s just that I did so many trial runs that the odds are one of them would make the die look biased. I can publish my results, claiming I only did one trial run, not letting anyone know about all the other thousand trial runs I needed to do in order to get an outcome where the die looks biased.
p-hacking is just this, except instead of doing multiple trial runs, you might look at a dataset in thousands of different ways, trying to correlate thousands of different variables in different ways. Odds are, just by accident, some of the variables are going to be correlated, with a strong enough correlation that you have a publishable result. But the result isn’t real, it’s just statistical noise in the data set that you found as a correlation because you tried to correlate so many different variables; if you looked at fresh data, that correlation would disappear.
To prevent this, before seeing the data, you first commit to which variables you hypothesize there’s a correlation between. Then you look at the data and check the correlation between *only* those variables. The correlation could still be accidental, but it’s far less likely because you’re only checking for one or a few possible correlations, not potentially thousands or millions.
Latest Answers