I know some statistics but I still have a hard time grasping what “controlling for a variable means”. For me, it means that you want to isolate the variance explained by a particular variable by controlling for variables that contribute with confounding variance.
E.g., I want to predict ice-cream sales. As a predictor, I choose outside temperature. Let’s say that this explains 25% of the variance in ice cream sales. Now, let’s say that I want to control for what time of the day it is. People might buy more ice cream around lunch than in the morning. This is confounding since I only want to know how much variance outside temperature contributes with. So, I control for time of day. Now, when I do this, the variance explained by temperature should decrease – right?
Or, does “controlling for” simply means including time of day as a predictor, just like outside temperature?
In: 7
In your example, you have two points I’d like to emphasize.
Yes, you would control for a variable by “removing” it from the data set. For example, if time of day needed to be controlled you would subset or partition your data so that only purchases *at* that time of day are included. So by removing it, I kind of mean, making the training data *only* data that fits that variable.
Secondly you should also check for intercorrelation between your variables. For example, if outside temperature is kept, well, time of day and outside temperature are correlated. So while time of day might imply things like you suggest, people getting ice cream during lunch break etc. It could also be that’s just the hottest time of day as well, so there is a relationship there.
This might get beyond your ELI5 prompt, but if you were doing predictive modeling you would check for significance between time of day and ice cream purchases AND significance between temperature and ice cream purchases. **You will always gain better modeling via adding more variables due to overfitting*****,*** but you might conclude that time of day can be excluded from your data set due to lack of significance. **BUT if you conclude there is an interaction affect between temperature and time of day** it’s best practice to keep both independent variables in your model, even if one on it’s own is determined to be insignificant. Does that make sense?
Latest Answers