statistically control for a variable

292 views

I know some statistics but I still have a hard time grasping what “controlling for a variable means”. For me, it means that you want to isolate the variance explained by a particular variable by controlling for variables that contribute with confounding variance.

E.g., I want to predict ice-cream sales. As a predictor, I choose outside temperature. Let’s say that this explains 25% of the variance in ice cream sales. Now, let’s say that I want to control for what time of the day it is. People might buy more ice cream around lunch than in the morning. This is confounding since I only want to know how much variance outside temperature contributes with. So, I control for time of day. Now, when I do this, the variance explained by temperature should decrease – right?

Or, does “controlling for” simply means including time of day as a predictor, just like outside temperature?

In: 7

7 Answers

Anonymous 0 Comments

**Short Version:**

– We statistically control for a variable when we were unable to control the input values of a control variable (CV).
– The goal is to be able to adjust the outputted values for the dependent variable(DV) (and in the case of confounding variables, also the inputted values of the independent variables(IVs)) to what they would have been if we *had* controlled the CVs input values.
– This is done by treating them like a predictor/causal variable while building our model – we need to do this because the variance in their input values is just as able to impact our outputted DV values as variances in our IV values.
– When we’ve completed our model, we can then remove the CV variable terms by substituting in their mean values, leaving us with an equation where the only unknowns are our IVs and DV.
– We can’t remove the variance they contributed to the model though, so the final model will be less accurate than if we had controlled the input values.

___

You sound like you already know a fair bit about regression analysis so sorry if this isn’t as technical an answer as you’re after, but I’ll try give a plain english explanation:

Control variables (CVs) are variables that affect our output that we’re not really interested in. Ideally we would control their input values to remain constant. This kind of controlling of inputs would mean we wouldn’t have to account for them during modelling. They also don’t contribute toward total model variance when controlled as inputs (because they didn’t vary at all haha).

If we’re unable to control the input values of a CV, it becomes necessary to “control” for it in our output values for the dependent variable(DV). This means including the CV along with our independent variables (IVs) as a predictor(causal variable) during modelling – aka statistically controlling for the CV.

We need to model not only the relationship between the CV and DV, but also any relationships between it and the values of other causal variables (see example below). Once we’ve completed our model, we can remove them as variables from the equation by plugging in their mean values. They do, however, continue to contribute towards total model variance (because there was variance in our input values for them).

**Example Time:**

>*E.g., suppose we’re interested in how the number of firefighters at the site of a fire affects the time taken to extinguish it:*

>- IV: *Number of firefighters present*
>- DV: *Time taken to extinguish fire*
>- CVs: *All other predicting/causal variables*

Various CVs will affect the DV through a few different mechanisms and will appear in our final model in different forms:

– **Additively:** the most basic form – CV value determines a set percentage of total var. (This is like your example of outside-temp explaining 25% of total var in ice-cream sales.)

> *eg* CV: Material being burned *- we would expect a stone or concrete building to be easier to put out than a wood one, which would be easier to put out than a factory that uses flammable petroleum products. To control for this variable we would separate our data into bins/categories and model each material separately.*

– **Confounding:** the CV influences both the DV and an IV. This can make it appear like the IV is influencing the DV much more than it actually is, since both values are being impacted by the same external factor. This can be controlled for by subtracting the impact of the CV from both variables before determining correlation.

> *eg* CV: Size of fire *- we’d expect larger fires to need more firefighters, and we’d also expect them to take longer to put out. We need to control for size of fire by adjusting all of the IV and DV values to what they would be for an average sized fire. If we didn’t control this variable, our model would falsely correlate more firefighters with more time taken to extinguish fires.*

– **Interaction:** the CVs effect depends on the value of a second causal variable (or vice versa, or both,) BUT neither variable is influenced by the value of the other. These can sometimes get super complex and icky to try to model accurately. E.g., the interaction might influence the value of a third causal variable rather the DV, giving you some kind of gross confounding interaction setup.

> *eg* CV: Water supply(volume/second) *interacting with *IV:* no. of firefighters – we would normally expect more firefighters to always cause a lower time taken to extinguish a fire, which would be true when there is plenty of water available.*
*But consider bringing in more and more firefighters in a situation with a limited water supply (eg rural areas, or some newly developed residential areas running on old water infrastructure). The effectiveness of a fire hoses is largely reliant on the very high pressure flow of water it ejects – once all of the available water supply is being used, we would expect the time taken to extinguish the fire to start to increase with an increasing number of firefighters. So to control for water supply we would need a variable adjustment factor, based on the ratio of water:firefighters, that accounts for the diminishing (eventually negative) correlation between the IV and DV when the ratio drops below a certain value.*

You are viewing 1 out of 7 answers, click here to view all answers.