So you have a bunch of data, and you would like to put a line through it. The problem is that your data doesn’t sit on a line. So instead you just pick a line that gets as close to all your data points as possible. This is a linear regression.
Some of your points won’t be on that line. The difference between a data point and the line is called the error. When you have a computer do this it’s going to select a line that minimizes the overall error.
Depending on your data this might work really well and it might not. The R-squared value is just a way of determining how well the line fits the data. If you have a R-squared value close to one, then you have a really good line. If you have an R-squared value close to zero, then your data is totally random and can’t be fit to a line.
In short: it’s the percentage of variation of the outcome variable affected by your predictor variable.
It’s a good measure of how influential your predictor variable of your model is of your outcome variable. When you take 1-R2, that gives you unexplained variance, which is comprised of additional factors outside your model that are additionally affecting the outcome variable.
Regression is about finding the ‘line of best fit’ for some data. R-Squared is a way of measuring wether the line of best fit describes the data well.
Imagine your data points on a graph, and you have a stright line at random going through the points. Then imagine you fastened elastic bands on each of the points and the line. They would pull the line into the ‘line of best fit’. This is essentially regression, as the elastic bands pull and get shorter, the line moves and twists, and on average gets closer to all the points. The length of the elastic bands is also called the error.
R squared is a measure of how close the line is to the points, or how low the overall error is, or how short your elastic bands are. If the points are all in a straight line then the length of your elastic is small. If the points are not correlated, if they’re all over the place, then the elastic is going to be stretched and taught, meaning a high error and a low R squared
You can put certain information on a grid, like houses on city blocks. So imagine you take a map and put a dot where each of your friends’ houses are. That’s like putting points on a graph. Now imagine you want to run a string-can telephone to all of your friends houses and you want the main string to go in a straight line. You could take a ruler and draw straight lines on the map and then measure how close the main string line gets to each of the dots you made. There’s a “best” line that gets closest to all of the dots on average. You can use math to find out what that line is basically by taking the distance from each dot from the line and timesing it by itself (so you only deal with positive numbers to make things easier) and then trying to find the line that has the lowest total distance from all the dots.
Essentially you have to take two measurements, one of an independent variable and one of a dependent. For example, sleep and GPA. Researchers plot the results they get on a graph. From this plot, the researchers use complex formulas to create a line that minimizes the *residuals*, or the difference between plotted data points and the line that they draw. They then use other complex formulas to determine how well the data fits the line; the closer it is clustered to the line, the better the fit. This is known as the *correlation coefficient*, or r.
R is good, but its sign always matches the slope of the graph (if y decreases as x increases, slope is negative). Researchers need a positive value, so they square r to get r-squared, the *coefficient of determination*. If your r-squared value is .64, you’d say that 64% of the variation in y is a result of the change in x.
Others have explained that a regression is just making a line go through some data points as best it can. Others talked about the R^2 term and what it means, but I feel they either don’t explain why or aren’t ELI5-level, so here’s my attempt.
When you draw a line through the data points, you need to have a way of deciding how “good” it is. What’s the best way to do that?
If you imagine it’s a 2-d plot, with an x and a y axis, then one thing you can do is measure the distance between each point and the line. You do that by subtracting the value of the line at the given point’s x coordinate from the point’s y coordinate. So if you had a point at (1, 1.5) and the line went through (1, 1), then the distance is 0.5. We don’t really care if the line is above or below the point, so we take the square of that distance so that negative distances and positive distances are always equivalent. It also has the benefit of giving larger distances a bigger impact. That’s called the “square residual” for that point. If you add those up for _all_ the points, you get something called the “residual sum of squares.”
That is a pretty good way to determine how well a line fits, and you can actually use it directly to find the best line – you can just keep changing the line until the residual sum of squares is as low as it will go.
However, depending on your dataset and how many points there are, that number can vary wildly anywhere from zero to infinity (for infinite points). That makes it hard to compare linear regressions on different datasets.
So, we look at another measure as well. We take the _average_ y-value of all the points, called the _mean_, and we measure the distance of each point from that mean, using the same squaring trick to make sure it’s always positive. If we add all of these up, now we have a measure of the _variance_ of the data – we know how much it is spread out in general. We call that the total sum of squares.
If you take 1 – SS_res / SS_tot, you get the R^2 coefficient. This number is bounded between 0 and 1, where if it’s zero, it means the line is just going through the average of all the points and isn’t predicting anything (aside from the average), and if it’s 1, it means it exactly goes through every data point, and exactly predicts everything. Anything in between gives you a good measure of how well the line fits the data – and most importantly, it has meaning relative to other datasets and regression lines.
You can interpret it as telling you what fraction of the variance in the data is _unexplained_ by the regression. In other words, if I propose that a linear model fits my data, and R^2 is 1, that means there is nothing else happening in that data that my line does not explain. If R^2 is zero, then it means my line explains exactly nothing in the data, and every single thing that is happening is caused by _something else_ that isn’t included in that line. If it’s 0.5, it means roughly half of the variance is explained by the line, but half is caused by something else.
That’s why, if you are proposing that a model should be linear, you want R^2 to be as high as possible. If it’s 0.8, it means 20% of what you observe isn’t explained by that model. Depending on the application, that could be more or less important, but in general it means that you would expect any estimates made by your model to have errors with a variance of about 20%.
I would look for a textbook that accompanies a stat software program. Andy Field has a series of books which cover regression at a understandable level (well, as understandable as possible; this does require some stats knowledge). I know he has a version to run regressions using both R and SPSS (different books). Having the full textbook explanation I think would be a better source for you as you figure this all out as well as a direct example of how to run the analysis in the specific program that you are using.
I visited this reddit to get the Crypto crash explained to me but I saw a stats question!
Instead of super kid-friendly talk, I tried to make a more visual explanation. [https://imgur.com/a/2yUPjB2](https://imgur.com/a/2yUPjB2)
Let error be the distance between a data point and a prediction line. If we turn those distances into squares, bigger distances become bigger squares. A distance of 2 becomes an square of 4, a distance of 4 becomes a square of 16. This means that bigger distances (bigger errors) become a bigger problem. **You might think of the error lines in the picture as springs and more stretched out springs pull even harder on the line.** So that’s error and squared error. Now let’s add up all the squared errors and call that the “variance” (the total amount of spring pull).
When we do a line of best fit model, there is a default line that is always available. That is the average value. Let’s compute that variance and call it the “total variance”. Now let’s plug in our line of best fit model and compute its variance. That is the leftover error that we have not explained.
We can ask how much the total variance changed. The variance explained is the total variance minus the leftover variance. Now, compute the fraction variance explained / total variance. That is the proportion of variance explained, R-squared. It’s how much total spring pull in the first picture disappeared when we use the line the second picture.
Not at all what OP was asking, but my favourite thing about regression is where the name came from. The term was coined by [Francis Galton in 1886](https://www.jstor.org/stable/2841583) to describe the tendency for tall people to have more average-height descendants. He used a line through a correlation to show this. He called the phenomenon “regression toward mediocrity” (!) and the name got stuck to the analysis he used to show it. Hence, regression analysis.
Latest Answers