Regression vs Correlation


Correlation quantifies the degree to which two variables are related. It does not fit a line through the data points to enable any form of prediction. It tells us how much one variable changes when another does. If the correlation coefficient (r) is equal to zero, there is no relationship. When r is positive, then as one value increases, so does the other and when it’s negative , as one variable goes up, the other goes down.

With correlation, it doesn’t matter which variable is X and which is Y. If you swap them around, the correlation coefficient (r) will remain the same.

Regression used whenever there is a causal relationship between variables (cause & effect) and does fit a line of best fit through the data. Where ε (epsilon) is the measure of how far above or below the true regression line the actual observation of y lies.

In regression, it does matter which variable is X and which is Y. If you swap the two, you will get a different line of best fit. The line that best predicts Y from X is not the same as that which predicts X from Y.

Regression is calculated using the formula ŷ = b0 + b1x. The variable definitions are below:

Y = variable trying to predict

X = independent variable

B0 = constant (like minimum wage)

B1 = quantifies the effect of X on Y

ŷ = estimated / predicted value for sample

If minimum wage was £10,000, then B0 would be equal to £10,000 as that is the constant (or the minimum). If we determine that each year of additional education adds £2,500 to someone’s salary expectation, the formula can be: ŷ = 10000 + 2500 (X)

In this example, X will be the number of additional years spent in education. If I have spent 3 years additional in education, I would expect Y to be £17,500.

So, in the above, the regression example gave us the ability to predict / estimate whereas the correlation would give us only an indication as to how related the two variables are.