Regression in stats

987 views

Please can someone explain to me regression in statistics and interpreting data.

And if possible, how this works in R.

Thank you very much 🙂

In: Mathematics

2 Answers

Anonymous 0 Comments

[removed]

Anonymous 0 Comments

Remember the point-slope formula for a line: y = mX + b

M is the slope, for each unit you move in the x direction, you move m units in the y direction.

B is the intercept, when x is 0, y is b.

Linear regression, the most common form of regression, uses linear algebra to calculate a line in this format from a group of points. The goal of the calculation is to minimize the distance of the line from the actual points. That is, it comes up with m and b so that when you plug each real value of x into it the average distance between the real value of y and the calculated value of y is the lowest.

Let’s say I have 3 points: (1,1) (3,7) and (5,9)

Those points don’t lie on one line, but the line y = 2x + 0 looks like it comes really close. If I plug 1, 3, and 5 in, I get 2, 6, and 10 out. Now, I can subtract 1-2, 6-7, and 10-9 to get -1, -1, and 1.

Now we have a problem, positives and negatives screw up my average because I care about the _distance_ not _direction_ of the estimation from the real value. So I can square the results, then average them. This case is simple, it comes out to (1+1+1)/3 = 1 for the average difference it error.

How does that compare to another estimate? What if I think the intercept is something else, like -1? y = 2x – 1

Again, plug the numbers in. I get 2-1=1, 6-1=5, and 10-1=9 for my estimates, and the differences are 0, 2, and 0. Square those and I end up with (0+4+0)/3 = 1 1/3 for the average difference, which is bigger than 1 so it’s a worse fit even though it gets two of the points right.

Squaring the difference accentuates large differences, The difference between y = 2x and y = 2x – 1 is that the first is closer to all the points while the second favors some of the points at expense of others. Favoring some of the points is called “bias.” Standard linear regressions are unbiased but other techniques can be biased. Sometimes there’s a trade off between bias and accuracy: In our example, the first was unbiased but didn’t get any of the points right, the second was biased but got 2/3 of the points right. Which is more important depends on what your using the regression for.

The real world isn’t simple, there are very few instances where only one piece of information determines an outcome. No worries, the math works the same for more X’s! The general form of the linear regression model is:

Y = B0 + B1X1 + B2X2 + … + BnXn + e

What this actually is, is the equation for a line in an n dimensional space. (Technically n+1 dimensions since y is also a dimension.) The B’s are actually the Greek letter beta, and the numbers are subscripts to denote which dimension or variable they belong to. (I’m on my phone, sorry.) B0 is the intercept moved to the front, and e is the “error term” which we’ll get to in a moment.

You might have seen the formula for a line in 3 dimensions as y = 4X + 8Z + 7 for example. In a linear regression you would see y = 7 + 4X1 + 8X2 + e. They both mean the same thing: For every unit we move in the X1 direction, we move 4 units in the y direction, and for every unit we move in the X2 direction, we move 8 units in the y direction, and if all the X’s are 0, y is 7.

Then there’s that pesky “e.” What it means is that we live in the real world where things are a little fuzzy. It could be that there are some unaccounted for X’s that affect y, and we can move in those directions as we’re moving in the known directions but don’t account for it in the model. Or it could be that there is some randomness going on, so every time we move one unit in the X1 direction, we move B1 +/- a little in the y direction.

That’s the intuition behind a linear regression: It’s the equation for a line. A change in each of the X’s results in the corresponding B’s worth of change in y.

The math details aren’t too complicated, but I’m on my phone, it’s a huge post already, and it’s ELI18 or so. Same points for R with the added issue that I’m not familiar with the language.

edited because autocorrect