Machine learning: Simple and multiple linear regression models in Python


Machine learning is described in detail in this article. Today, I want to run through a simple machine learning model, that uses linear regression.

What is regression?

Regression aims to predict the numeric value of something, given a set of input parameters. For example, the we could approximate the price of a car, given its mileage, age, brand, MOT status, etc.. In this simple example, we’re going to predict the output value, based on three randomly generated input variables. In our real-world example variables could be mileage, age and miles since last service.

To put linear regression simply, it’s about creating a line of best fit on a graph. So in this example, if X is 3, then we would expect Y to be 6.

The process

The process that this model follows is:

  • 1. Create some training input and output data
  • 2. Inject that data into the model so that it can fit a line to it
  • 3. Test that model works

Of course, in this example, we’re creating the output data, knowing the exact relationship (i.e. output = a + b + (100*c). Hence, we expect the coefficients for the output to be 1, 1, 100. In the ‘real world’ we would not have such a direct relationship, so the output data would serve to train the model (rather than tell it what we already know).

The code

So, what about bringing my own data in (rather than randomly generating it)?

Below, I’ve done exactly that, using Pandas to read my CSV in, in addition to using Matplot Lib to show the linear regression plot.

What about multiple linear regression?

In the below, I have added GPA (Grade Point Average) into the csv. This now gives us two variables to input into the model. The first, is the number of hours that the students studied for the exam and the second is their GPA across all other subjects. As you can see, it’s adjusted the prediction to factor this in.

Checking model accuracy:

You saw in the previous post on this topic that to train a model, we present it with all of the observations and the truth (the actual outcome). This is so it can build a model to make future predictions with.

In order to test our model, we need to split our data into 4:

  • Training data: observations
  • Training data: results/labels
  • Testing data: observations
  • Testing data: results/labels

We then build the model using the two training data datasets and test it by:

  1. Run the testing data observations through the model prediction
  2. Calculate the degree of error between the predicted results and the actual results/labels
  3. Calculate model accuracy percentage

The below script outlines how to do this: