The standard example is in medicine. Let says you’ve discovered a new vaccine against zombification in some apocalyptic future.
The old vaccine is not that great. Out of 1000 patients, only 500 were successfully protected and the other 500 turned into zombies, so **50% success rate**.
You’re testing your new vaccine, and …. that’s worse, out of your new 1000 patients, only 400 were successfully protected and the other 600 turned into zombies, so **40% success rate**.
Does that means your vaccine is bad? Not necessarily, you might be missing a variable. In this fictive example, the missing variable is the age of the patient.
The first 1000 patients were 800 kids and 200 adults, and 500 kids were protected while no adults were protected, so a **71% success rate for kids and 0% for adults.**
The second 1000 patients were 200 kids and 800 adults, and all 200 kids were protected and 200 adults were protected too, so a **100% success rate for kids and a 25% for adults.**
In other words, this new vaccine is strictly superior to the old one. Still not perfect, but better. The new one seemed like it was worse because you were missing a variable. And in fact, maybe the new one is actually worse because you are missing another variable that turn the result on its head once again (like the sex of the patient).
When do something qualify as a “missed variable bias”:
* A variable must be missing from the problem (here the age of the patients)
* This variable must be **determinant of the result**, in other words this variable has an influence on the likelihood of success/failure (kids are more likely to be protected by anti-zombie vaccines than adults, so the age is determinant)
* This variable must be **correlated with another variable**, in other words our input data must not be homogeneous according to our missing variable (here the first batch of patients are a lot of kids while the second one has almost no kids).
Latest Answers