Tech target defines correlation as: “a statistical measure that indicates the extent to which two or more variables fluctuate together”. There are two major components to correlation: strength and type of correlation.
Let’s look at some types:
- Positive correlation means, when X goes up, Y goes up
- Negative correlation means, when X goes up, Y does down
- None simply means, there is no correlation
- A perfect correlation almost never happens, it’s when the dots fall perfectly on the line and there is a perfect relationship between X and Y.
So now we know what the types of correlation are, let’s look at the strengths:
- Strong means all of the dots on the scatterplot are very close to the line
- Moderate means that the dots are kind of close to the line
- Weak means that the dots really aren’t close at all
So, this is all great, but it’s all a bit subjective. I might think a correlation is moderate, while someone else believes it’s strong. We can quantify this with some statistics!
We use the correlation coefficient to do this – it’s denoted as r. r is a quantification of how correlated X and Y are – it runs between negative 1 and plus 1.
When we look at the score, we can use the below table to determine how strong the relationship is:
Now, here comes the fun part, the formula – it looks much worse than it is, let’s break it down.
X and Y form my dataset. We need to add 3 columns:
- X squared
- Y squared
- X * Y
Once we have calculated each of these, we can sum them (as shown above). I have denoted which piece of the formula each refers to, just below the summed number.
As below, we can then map those numbers into the formula.
We can then start to work out the result:
So, in this case, we have a very strong positive correlation of 0.94.
Note: correlation coefficients can only be done on a normal distribution and remember, correlation does not infer causation.