Linear Regression


Linear regression provides a rule which enables us to make predictions of Y based on the X. Effectively, it fits a line of best fit to a scatter plot, where the sum of the squares (the space between the datapoints & the line is least).

The equation to calculate that line is:

We need to calculate b and a. To use an example to do so, let’s use the same dataset as we used during a previous article:

Using this data, we can calculate b. The formula looks a little scary, but I’ve worked through it step by step below:

So now we know b. It’s 1.07. Let’s now work out a, which requires us to know b, so must be done second. A is the y intercept – imagine we were talking about the effect of education on earning potential. The y intercept would be the minimum wage it’s possible to earn, so would not be zero.

So, b is 1.07 and a is -83,3. We can now fit this into our original formula.

So, it X were 114, y would be 38.7.

Residual is what we call the difference between the estimated value and the actual outcome. A high residual means that the line doesn’t fit the data very well.

Coefficient of determination (r squared)

R squared is also known as the coefficient of determination. It tells us how well the line fits the data & denotes the proportion of the variance in the dependent variable that is predictable from the independent variable. In other words, if X increases by 4, how predictable is the impact on Y?

An r squared of zero means that we cannot predict Y from X.

An r squared of 1 means we can predict Y from X with no error.

We have previously calculated r here. We simply take this value and square it. This makes a negative value positive and presents a percentage output.

If r is 0.95 then r squared is 0.90 (90%). This means that 90% of the variance of Y is predictable.

We can compare models with r squared. If we have a low r squared, we can add more input variables to make the model more accurate.