What is p-hacking, how does it work, and what does it mean for science in general?

625 viewsOther

I’ve been reading about how some studies that we assume provide us with information about the world actually don’t teach us anything because of statistical manipulation that makes them look more relevant than they are. How does this work?

In: Other

12 Answers

Anonymous 0 Comments

P-hacking occurs when you have so much data that can be sliced so many ways that you can probably find something that seems surprising. Sometimes it’s intentional, but sometimes it’s just overenthusiasm combined with poor documentation of the experiment or study.

The core of the problem is that the “p” in p-hacking represents a probability. Like “there’s less than a 5% chance of this result happening by random chance.” But if I have 100 different ways I can analyze the data, I’ll probably find 5 different results that seem significant, but are actually just random.Example:

I think vitamins are good for you, and I want to help everyone by doing science on vitamins. So I get data from people all over the USA:

* what vitamins they take.
* where they live.
* how old they are.
* how healthy they are.

If I look for patterns in the data, I may find patterns like this:

* **AMAZING MEDICAL NEWS: Older women in Southern states that take a daily dose of vitamin C are less likely to have a heart attack compared to those that don’t. Scientists conclude that vitamin C stops sunburn from causing heart problems in women.**
* (Sad face) Children in cities who take vitamin B daily are more likely to be overweight than those who don’t. I conclude that parents of overweight kids must be trying to help them by giving them extra vitamins. I dont think that’s an interesting result on vitamins, so you never see that result.

The main way to prevent p-hacking is therefore this:

* Before I analyze the data, maybe even before I get the data, I write down every analysis I plan to do and how I will judge what’s interesting. I commit to including that writeup in every publication I make.
* Before getting the data, I probably didnt have a theory that there would be interesting effect for the specific combo of vitamin c – women – south – heart attacks. So I’m not allowed to make any conclusions on that.
* I may have had a theory about vitamin B and children’s obesity. If I did and included it in the plan, I’m compelled to discuss that. It might be “negative health outcome in urban children, no outcome in rural children. Possible confounders are vitamin prescriptions for obese children.”

Edit to add:

After finding the older women – vitamin c – south – heart result, I *AM* allowed to redo a study on that, but I have to start from scratch. I can’t use the same data. Possibly I cant even use the same method of collecting data. But if I come up with a totally different sample of old southern women and I see the effect *there* then I do have a valid result. I can even say “I saw this in the data from an earlier study and I thought it was interesting <insert details>, so I came up with this study to test it.”

Anonymous 0 Comments

Let’s say Josh does a study and shows with 90% confidence that the smell of freshly cooked brownies while taking an exam lowers test scores due to people being distracted by the smell. 90% confidence sounds good but it only means there is a 1 out of 10 chance that Josh’s study is correct.

Now let’s say Charlie reads this study and he has the thought “I wonder if I can reproduce this study with chocolate chip cookies!”. So Charlie tries cookies and finds it doesn’t work. Then Charlie says “that must mean chocolate cookies do not smell good enough to distract people on tests”. So Charlie continues to try the study again on cakes, candy, ice cream, double stuff Oreos, and more. On his 10th attempt, he tries Starbucks cake pops and Charlie shows with 90% confidence that the smell of Starbucks cake pops lowers test scores and publishes his results.

Now Emily reads two studies on this topic and thinks “wow this is a really well established theory” and writes a review on how we now have multiple studies showing that the smell of any fresh deserts can distract test takers.

This is p-hacking. Basically, you discard all the cases that don’t work but the problem is that you are only 90% (9 out of 10) confident and you tried 10 times. This means that you are very likely to find the result you were looking for by chance.

90% confident is related to what people call the p- value and hacking is discarding relevant data that doesn’t fit your theory.