I know some statistics but I still have a hard time grasping what “controlling for a variable means”. For me, it means that you want to isolate the variance explained by a particular variable by controlling for variables that contribute with confounding variance.

E.g., I want to predict ice-cream sales. As a predictor, I choose outside temperature. Let’s say that this explains 25% of the variance in ice cream sales. Now, let’s say that I want to control for what time of the day it is. People might buy more ice cream around lunch than in the morning. This is confounding since I only want to know how much variance outside temperature contributes with. So, I control for time of day. Now, when I do this, the variance explained by temperature should decrease – right?

Or, does “controlling for” simply means including time of day as a predictor, just like outside temperature?

In: 7

Yes, it means including time of day as a predictor. When you compute the best fit line, with one predictor the equation will look like this:

sales = x(temp) + C

with C being a constant or what your sales would be if temp were 0. But if you were to compute the best fit line with time of day as well, the equation would look like this

Sales = x(temp) + z(hour) + C

Note: I’m only using “z” to represent the second coefficient because I’m on mobile and can’t do subscripts. Normally you would denote all the coefficients as “x” with subscripts on them (x0,x1,x2…) this graph is still only 2 dimensions.

When you compute the best fit line this way, you can be more certain that the time of day is not influencing the coefficient for your temperature variable because it’s already factored into the model with its own coefficient. I hope this helps. I used some statistical jargon, because you seem to already be introduced to single variable analysis.

Controlling for means this variable is eliminated from the data by some other method before evaluating based on the criteria you want. For example, in medical studies they isolate for the placebo effect by randomizing the pool of candidates and make it a blind or double blind study.

Check out the wiki: https://en.m.wikipedia.org/wiki/Controlling_for_a_variable#:~:text=In%20controlled%20experiments%20of%20medical,such%20as%20the%20placebo%20effect.

In your specific example of ice cream sales, it would be hard to isolate temperature from time of day since temperature fluctuates through the day. You may have to either work with a daily average sale and daily avg temp or isolate down to the same selling period so that the “time of day” doesn’t influence sales. For example, using 11am-1pm to capture your data for analysis and then comparing to to other days at the same time period.

In your example, you have two points I’d like to emphasize.

Yes, you would control for a variable by “removing” it from the data set. For example, if time of day needed to be controlled you would subset or partition your data so that only purchases *at* that time of day are included. So by removing it, I kind of mean, making the training data *only* data that fits that variable.

Secondly you should also check for intercorrelation between your variables. For example, if outside temperature is kept, well, time of day and outside temperature are correlated. So while time of day might imply things like you suggest, people getting ice cream during lunch break etc. It could also be that’s just the hottest time of day as well, so there is a relationship there.

This might get beyond your ELI5 prompt, but if you were doing predictive modeling you would check for significance between time of day and ice cream purchases AND significance between temperature and ice cream purchases. **You will always gain better modeling via adding more variables due to overfitting*****,*** but you might conclude that time of day can be excluded from your data set due to lack of significance. **BUT if you conclude there is an interaction affect between temperature and time of day** it’s best practice to keep both independent variables in your model, even if one on it’s own is determined to be insignificant. Does that make sense?

“Controlling for” a variable means doing something to make sure it doesn’t vary.

So, if you’re looking at ice cream sales varying with temperature, and you want to control for time of day, you could just make sure that your statistics are all at the same time of day. If you look at ice cream sales between 1 and 2 PM every day, you have successfully controlled for time of day: since it’s the same for all of them, any differences between the data must be due to some other factor.

In short, you want to make sure that everything except the thing you’re trying to measure is consistent.

There are statistical methods you can do to check this if it’s not in a context where you can control how the data is collected. But generally the idea is to rule out that variable as a possible explanation.

You control for the time of day by binning your data according to the time of day. So you then decide whether ice cream sales correlate with outside temperature only within each time of day slot separately. You don’t care whether time of day also predicts ice cream sales, because you already know that it (very likely) does, and it’s not the question that you want to explore.