Omitted Variables bias, comprehensive example and definitions


Omitted Variables bias, comprehensive example and definitions

In: 1

Example, you want to understand the value of an NBA basketball player. You propose a model that considers three factors x=height, y=rebounding average per game, z=scoring average per game. You do a fit against all the players in the NBA and come up with best fit coefficients that go with these variables. Let’s presume you make no math mistakes.

Now someone asks you to do an omitted variable analysis where you just assess players according to only z. This could turn out to match your model’s coefficient for z pretty well, presuming that height and rebounding are independent from scoring.

But, if someone asks you to do that omitted variable analysis for x, you get a very different coefficient. That’s because height and rebounding are not unrelated, it turns out that tall players are significantly better rebounders than short players. This correlation means that your x analysis in biased incorrectly – and has produced the wrong coefficient.

The standard example is in medicine. Let says you’ve discovered a new vaccine against zombification in some apocalyptic future.

The old vaccine is not that great. Out of 1000 patients, only 500 were successfully protected and the other 500 turned into zombies, so **50% success rate**.

You’re testing your new vaccine, and …. that’s worse, out of your new 1000 patients, only 400 were successfully protected and the other 600 turned into zombies, so **40% success rate**.

Does that means your vaccine is bad? Not necessarily, you might be missing a variable. In this fictive example, the missing variable is the age of the patient.

The first 1000 patients were 800 kids and 200 adults, and 500 kids were protected while no adults were protected, so a **71% success rate for kids and 0% for adults.**

The second 1000 patients were 200 kids and 800 adults, and all 200 kids were protected and 200 adults were protected too, so a **100% success rate for kids and a 25% for adults.**

In other words, this new vaccine is strictly superior to the old one. Still not perfect, but better. The new one seemed like it was worse because you were missing a variable. And in fact, maybe the new one is actually worse because you are missing another variable that turn the result on its head once again (like the sex of the patient).

When do something qualify as a “missed variable bias”:

* A variable must be missing from the problem (here the age of the patients)
* This variable must be **determinant of the result**, in other words this variable has an influence on the likelihood of success/failure (kids are more likely to be protected by anti-zombie vaccines than adults, so the age is determinant)
* This variable must be **correlated with another variable**, in other words our input data must not be homogeneous according to our missing variable (here the first batch of patients are a lot of kids while the second one has almost no kids).

you want to measure effect of college degree on wage.

you get data on wages, and having a college degree (from like 2000, before all the unemployed art majors)

you find pretty strong correlation, and substantially higher wage for college graduates vs. high school graduates.

But: a person who has a college degree is the one who can get into college in the first place. So must have had a few things going for them, and they might have done fairly well in life even without a college degree.

“Things going for them” is an omitted variable.
The extra wage they would have gotten for it is the bias.

The way to fight this bias is try to measure the omitted factor. e.g. through Parent’s income, or Parent’s education.