# Machine learning: Simple and multiple linear regression models in Python

Machine learning is described in detail in this article. Today, I want to run through a simple machine learning model, that uses linear regression.

### What is regression?

Regression aims to predict the numeric value of something, given a set of input parameters. For example, the we could approximate the price of a car, given its mileage, age, brand, MOT status, etc.. In this simple example, we’re going to predict the output value, based on three randomly generated input variables. In our real-world example variables could be mileage, age and miles since last service.

To put linear regression simply, it’s about creating a line of best fit on a graph. So in this example, if X is 3, then we would expect Y to be 6.

### The process

The process that this model follows is:

• 1. Create some training input and output data
• 2. Inject that data into the model so that it can fit a line to it
• 3. Test that model works

Of course, in this example, we’re creating the output data, knowing the exact relationship (i.e. output = a + b + (100*c). Hence, we expect the coefficients for the output to be 1, 1, 100. In the ‘real world’ we would not have such a direct relationship, so the output data would serve to train the model (rather than tell it what we already know).

### The code

```# Firstly, we need to import the library to allow us to generate random numbers as our input data.
from random import randint
#
# next, we're going to need to import the python library required for linear regression
from sklearn.linear_model import LinearRegression
#
# then, we tell the model to set the limit, so numbers cannot exceed 1000 in value
training_limit = 1000
#
# next, we tell the model to only create 100 values
vals = 100
#
# Here, we're creaitng a list to store the input & another to store the output of the model
training_data_in = list()
training_data_out = list()
#
# Then, we're going to create a dataset. So, we need 100 values (as defined in the variable above) - so it's going to loop through 100 times and create a random value for a, b and c.
#in the below the randint function is going to create a number between 0 and 1,000 (the value of the training_limit variable)
for i in range(vals):
a = randint(0, training_limit)
b = randint(0, training_limit)
c = randint(0, training_limit)
#
# Now, we're going to store the values of a, b and c in our list.
training_data_in.append([a, b, c])
#
# next, we need to run our calculations. In this example, the output is equal to the value of a + b + (100 * c)
output_calc = a + b + (100 * c)
#
# We're now going to add each of those computed figures to our output list.
training_data_out.append(output_calc)
#
# So now, we define the model as being a linear regression model where n_jobs = -1self.n_jobs can be an integer or equal to none. The default is none. This is the number of jobs to use for the calculation. -1 means we are using all processors for the calculation.
# n_jobs can be an integer or equal to none. The default is none. This is the number of jobs to use for the calculation. -1 means we are using all processors for the calculation.
model = LinearRegression(n_jobs =-1)
#
# Now we need to populate the model with the input and output data that we generated above
model.fit(X = training_data_in, y = training_data_out)
#
# Now, let's test our model. My test for X (i.e. the input data) is below. I'm going to add my data of 5, 10 and 15 into the model
test_data = [[ 5, 10, 15 ]]
#
# Then, we inject those test values into our model model
outcome = model.predict(X = test_data)
#
# Calculate the coefficients
coefficients = model.coef_
#
# Print our results, based on our test data of 5, 10, 15
print('predicted value:')
print(outcome)
print(' ')
print('coefficients:')
print(coefficients)```

### So, what about bringing my own data in (rather than randomly generating it)?

Below, I’ve done exactly that, using Pandas to read my CSV in, in addition to using Matplot Lib to show the linear regression plot.

```In [1]: import pandas as pd
In [4]: df
Out[4]:
0 1 10
1 2 20
2 3 30
3 4 40
4 5 50
5 6 60
6 7 70
7 8 80
8 9 90
9 10 100
#
#X and y are equal to the values in a single column of the dataframe
In [5]: X = df[['Hours Studied']]
#
In [7]: from sklearn.linear_model import LinearRegression
In [8]: model = LinearRegression().fit(X, y)
#
#Now we can test the model, based on 5 hours study time. The output is correct - 50% would be the expected grade.
In [9]: model.predict([[5]])
Out[9]: array([50.])
#
#Now, let's plot this!
In [10]: import matplotlib.pyplot as plt
In [11]: plt.scatter(X,y)
Out[12]:
In [13]: plt.show()```

## What about multiple linear regression?

In the below, I have added GPA (Grade Point Average) into the csv. This now gives us two variables to input into the model. The first, is the number of hours that the students studied for the exam and the second is their GPA across all other subjects. As you can see, it’s adjusted the prediction to factor this in.

```#set a path for the CSV and create a dataframe
#Take a look at the dataframe you created
In [3]: df
Out[3]:
0 1 40 40
1 2 60 50
2 3 66 60
3 4 55 70
4 5 59 80
5 6 67 85
6 7 81 77
7 8 86 90
8 9 71 67
9 10 83 91
#Define x as having multiple input variables
In [4]: x = df[['Hours_Studied','GPA']]
#Inspect x
In [5]: x
Out[5]:
Hours_Studied GPA
0 1 40
1 2 60
2 3 66
3 4 55
4 5 59
5 6 67
6 7 81
7 8 86
8 9 71
9 10 83
#Define y as the truth (or the output)
#Import the linear regression model and make the prediction
In [7]: from sklearn.linear_model import LinearRegression
In [8]: model = LinearRegression().fit(x, y)
In [9]: model.predict([[5, 47]])
Out[9]: array([[62.09167157]])```

## Checking model accuracy:

You saw in the previous post on this topic that to train a model, we present it with all of the observations and the truth (the actual outcome). This is so it can build a model to make future predictions with.

In order to test our model, we need to split our data into 4:

• Training data: observations
• Training data: results/labels
• Testing data: observations
• Testing data: results/labels

We then build the model using the two training data datasets and test it by:

1. Run the testing data observations through the model prediction
2. Calculate the degree of error between the predicted results and the actual results/labels
3. Calculate model accuracy percentage

The below script outlines how to do this:

```In [1]: path = 'Desktop/grade.csv'
In [3]: x = df[['Hours_Studied','GPA']]
In [5]: from sklearn.model_selection import train_test_split
#Split your dataset into features and labels for training and then validating/testing the model
In [6]: train_features, test_features, train_labels, test_labels = train_test_s
...: plit(x, y, test_size = 0.25, random_state = 42)
In [7]: print(train_features.shape)
...: print(train_labels.shape)
...: print(test_features.shape)
...: print(test_labels.shape)
(7, 2)
(7, 1)
(3, 2)
(3, 1)
In [8]: from sklearn.linear_model import LinearRegression
#Present scikit learn with the features and the truth so it can create a model
In [9]: model = LinearRegression().fit(train_features, train_labels)
In [10]: model.predict([[5, 47]])
Out[11]: array([[69.94198545]])
#Run the model again based on the test data
In [12]: predictions = model.predict(test_features)
#Calculate the error margin by taking the predicted value & subtracting the 'truth'
In [13]: errors = abs(predictions - test_labels)
In [14]: errors
Out[14]:
8 24.589279
1 4.253585
5 9.399127
#We can therefore work out the % error, by taking errors/actuals *100
In [15]: percent_error = 100 * (errors / test_labels)
In [16]: percent_error
Out[16]: