I know some statistics but I still have a hard time grasping what “controlling for a variable means”. For me, it means that you want to isolate the variance explained by a particular variable by controlling for variables that contribute with confounding variance.
E.g., I want to predict ice-cream sales. As a predictor, I choose outside temperature. Let’s say that this explains 25% of the variance in ice cream sales. Now, let’s say that I want to control for what time of the day it is. People might buy more ice cream around lunch than in the morning. This is confounding since I only want to know how much variance outside temperature contributes with. So, I control for time of day. Now, when I do this, the variance explained by temperature should decrease – right?
Or, does “controlling for” simply means including time of day as a predictor, just like outside temperature?
In: 7
You control for the time of day by binning your data according to the time of day. So you then decide whether ice cream sales correlate with outside temperature only within each time of day slot separately. You don’t care whether time of day also predicts ice cream sales, because you already know that it (very likely) does, and it’s not the question that you want to explore.
Yes, it means including time of day as a predictor. When you compute the best fit line, with one predictor the equation will look like this:
sales = x(temp) + C
with C being a constant or what your sales would be if temp were 0. But if you were to compute the best fit line with time of day as well, the equation would look like this
Sales = x(temp) + z(hour) + C
Note: I’m only using “z” to represent the second coefficient because I’m on mobile and can’t do subscripts. Normally you would denote all the coefficients as “x” with subscripts on them (x0,x1,x2…) this graph is still only 2 dimensions.
When you compute the best fit line this way, you can be more certain that the time of day is not influencing the coefficient for your temperature variable because it’s already factored into the model with its own coefficient. I hope this helps. I used some statistical jargon, because you seem to already be introduced to single variable analysis.
Controlling for means this variable is eliminated from the data by some other method before evaluating based on the criteria you want. For example, in medical studies they isolate for the placebo effect by randomizing the pool of candidates and make it a blind or double blind study.
Check out the wiki: https://en.m.wikipedia.org/wiki/Controlling_for_a_variable#:~:text=In%20controlled%20experiments%20of%20medical,such%20as%20the%20placebo%20effect.
In your specific example of ice cream sales, it would be hard to isolate temperature from time of day since temperature fluctuates through the day. You may have to either work with a daily average sale and daily avg temp or isolate down to the same selling period so that the “time of day” doesn’t influence sales. For example, using 11am-1pm to capture your data for analysis and then comparing to to other days at the same time period.
In your example, you have two points I’d like to emphasize.
Yes, you would control for a variable by “removing” it from the data set. For example, if time of day needed to be controlled you would subset or partition your data so that only purchases *at* that time of day are included. So by removing it, I kind of mean, making the training data *only* data that fits that variable.
Secondly you should also check for intercorrelation between your variables. For example, if outside temperature is kept, well, time of day and outside temperature are correlated. So while time of day might imply things like you suggest, people getting ice cream during lunch break etc. It could also be that’s just the hottest time of day as well, so there is a relationship there.
This might get beyond your ELI5 prompt, but if you were doing predictive modeling you would check for significance between time of day and ice cream purchases AND significance between temperature and ice cream purchases. **You will always gain better modeling via adding more variables due to overfitting*****,*** but you might conclude that time of day can be excluded from your data set due to lack of significance. **BUT if you conclude there is an interaction affect between temperature and time of day** it’s best practice to keep both independent variables in your model, even if one on it’s own is determined to be insignificant. Does that make sense?
“Controlling for” a variable means doing something to make sure it doesn’t vary.
So, if you’re looking at ice cream sales varying with temperature, and you want to control for time of day, you could just make sure that your statistics are all at the same time of day. If you look at ice cream sales between 1 and 2 PM every day, you have successfully controlled for time of day: since it’s the same for all of them, any differences between the data must be due to some other factor.
In short, you want to make sure that everything except the thing you’re trying to measure is consistent.
There are statistical methods you can do to check this if it’s not in a context where you can control how the data is collected. But generally the idea is to rule out that variable as a possible explanation.
With statistics, there are often 2 (or more) different ways of doing things: via study design on the front end, or via modeling method on the back end.
“Controlling for” a variable just means “we have taken steps to make sure this variable probably isn’t affecting things (much)”. There are multiple ways to do this.
Generally, the more variables you can “control for” with study design, the better, because this makes the data you gather cleaner. In this case, that would look something like only sampling data points at the same time of day, say 2pm each day. Then, when you’re examining sales vs temperatures, you can be reasonably confident that time of day isn’t confounding things.
Of course, this might make data collection a lot harder, so we can also use statistical methods to “control for” this affect in our model after the fact. Like you mentioned, this generally looks like including that variable as a predictor, so that its effect will be largely captured in its own coefficient, and then the coefficient of the temperature is less affected by time of day.
If you’ve “controlled for” a potentially confounding variable, it essentially means you have removed the variance in your data that you think is attributable to that variable. If you did this via study design, the explained variance % of your remaining variables should *increase*, because now hopefully more of the remaining variance is truly due to the effect you’re measuring. If you did it via stats, it will depend on the exact method you use (simple linear, mixed effects, etc.).
**Short Version:**
– We statistically control for a variable when we were unable to control the input values of a control variable (CV).
– The goal is to be able to adjust the outputted values for the dependent variable(DV) (and in the case of confounding variables, also the inputted values of the independent variables(IVs)) to what they would have been if we *had* controlled the CVs input values.
– This is done by treating them like a predictor/causal variable while building our model – we need to do this because the variance in their input values is just as able to impact our outputted DV values as variances in our IV values.
– When we’ve completed our model, we can then remove the CV variable terms by substituting in their mean values, leaving us with an equation where the only unknowns are our IVs and DV.
– We can’t remove the variance they contributed to the model though, so the final model will be less accurate than if we had controlled the input values.
___
You sound like you already know a fair bit about regression analysis so sorry if this isn’t as technical an answer as you’re after, but I’ll try give a plain english explanation:
Control variables (CVs) are variables that affect our output that we’re not really interested in. Ideally we would control their input values to remain constant. This kind of controlling of inputs would mean we wouldn’t have to account for them during modelling. They also don’t contribute toward total model variance when controlled as inputs (because they didn’t vary at all haha).
If we’re unable to control the input values of a CV, it becomes necessary to “control” for it in our output values for the dependent variable(DV). This means including the CV along with our independent variables (IVs) as a predictor(causal variable) during modelling – aka statistically controlling for the CV.
We need to model not only the relationship between the CV and DV, but also any relationships between it and the values of other causal variables (see example below). Once we’ve completed our model, we can remove them as variables from the equation by plugging in their mean values. They do, however, continue to contribute towards total model variance (because there was variance in our input values for them).
**Example Time:**
>*E.g., suppose we’re interested in how the number of firefighters at the site of a fire affects the time taken to extinguish it:*
>- IV: *Number of firefighters present*
>- DV: *Time taken to extinguish fire*
>- CVs: *All other predicting/causal variables*
Various CVs will affect the DV through a few different mechanisms and will appear in our final model in different forms:
– **Additively:** the most basic form – CV value determines a set percentage of total var. (This is like your example of outside-temp explaining 25% of total var in ice-cream sales.)
> *eg* CV: Material being burned *- we would expect a stone or concrete building to be easier to put out than a wood one, which would be easier to put out than a factory that uses flammable petroleum products. To control for this variable we would separate our data into bins/categories and model each material separately.*
– **Confounding:** the CV influences both the DV and an IV. This can make it appear like the IV is influencing the DV much more than it actually is, since both values are being impacted by the same external factor. This can be controlled for by subtracting the impact of the CV from both variables before determining correlation.
> *eg* CV: Size of fire *- we’d expect larger fires to need more firefighters, and we’d also expect them to take longer to put out. We need to control for size of fire by adjusting all of the IV and DV values to what they would be for an average sized fire. If we didn’t control this variable, our model would falsely correlate more firefighters with more time taken to extinguish fires.*
– **Interaction:** the CVs effect depends on the value of a second causal variable (or vice versa, or both,) BUT neither variable is influenced by the value of the other. These can sometimes get super complex and icky to try to model accurately. E.g., the interaction might influence the value of a third causal variable rather the DV, giving you some kind of gross confounding interaction setup.
> *eg* CV: Water supply(volume/second) *interacting with *IV:* no. of firefighters – we would normally expect more firefighters to always cause a lower time taken to extinguish a fire, which would be true when there is plenty of water available.*
*But consider bringing in more and more firefighters in a situation with a limited water supply (eg rural areas, or some newly developed residential areas running on old water infrastructure). The effectiveness of a fire hoses is largely reliant on the very high pressure flow of water it ejects – once all of the available water supply is being used, we would expect the time taken to extinguish the fire to start to increase with an increasing number of firefighters. So to control for water supply we would need a variable adjustment factor, based on the ratio of water:firefighters, that accounts for the diminishing (eventually negative) correlation between the IV and DV when the ratio drops below a certain value.*
Latest Answers