# Machine learning: A simple linear regression model in Python

Machine learning is described in detail in this article. Today, I want to run through a simple machine learning model, that uses linear regression.

### What is regression?

Regression aims to predict the numeric value of something, given a set of input parameters. For example, the we could approximate the price of a car, given its mileage, age, brand, MOT status, etc.. In this simple example, we’re going to predict the output value, based on three randomly generated input variables. In our real-world example variables could be mileage, age and miles since last service.

To put linear regression simply, it’s about creating a line of best fit on a graph. So in this example, if X is 3, then we would expect Y to be 6.

### The process

The process that this model follows is:

• 1. Create some training input and output data
• 2. Inject that data into the model so that it can fit a line to it
• 3. Test that model works

Of course, in this example, we’re creating the output data, knowing the exact relationship (i.e. output = a + b + (100*c). Hence, we expect the coefficients for the output to be 1, 1, 100. In the ‘real world’ we would not have such a direct relationship, so the output data would serve to train the model (rather than tell it what we already know).

### The code

```# Firstly, we need to import the library to allow us to generate random numbers as our input data.
from random import randint
#
# next, we're going to need to import the python library required for linear regression
from sklearn.linear_model import LinearRegression
#
# then, we tell the model to set the limit, so numbers cannot exceed 1000 in value
training_limit = 1000
#
# next, we tell the model to only create 100 values
vals = 100
#
# Here, we're creaitng a list to store the input & another to store the output of the model
training_data_in = list()
training_data_out = list()
#
# Then, we're going to create a dataset. So, we need 100 values (as defined in the variable above) - so it's going to loop through 100 times and create a random value for a, b and c.
#in the below the randint function is going to create a number between 0 and 1,000 (the value of the training_limit variable)
for i in range(vals):
a = randint(0, training_limit)
b = randint(0, training_limit)
c = randint(0, training_limit)
#
# Now, we're going to store the values of a, b and c in our list.
training_data_in.append([a, b, c])
#
# next, we need to run our calculations. In this example, the output is equal to the value of a + b + (100 * c)
output_calc = a + b + (100 * c)
#
# We're now going to add each of those computed figures to our output list.
training_data_out.append(output_calc)
#
# So now, we define the model as being a linear regression model where n_jobs = -1self.n_jobs can be an integer or equal to none. The default is none. This is the number of jobs to use for the calculation. -1 means we are using all processors for the calculation.
# n_jobs can be an integer or equal to none. The default is none. This is the number of jobs to use for the calculation. -1 means we are using all processors for the calculation.
model = LinearRegression(n_jobs =-1)
#
# Now we need to populate the model with the input and output data that we generated above
model.fit(X = training_data_in, y = training_data_out)
#
# Now, let's test our model. My test for X (i.e. the input data) is below. I'm going to add my data of 5, 10 and 15 into the model
test_data = [[ 5, 10, 15 ]]
#
# Then, we inject those test values into our model model
outcome = model.predict(X = test_data)
#
# Calculate the coefficients
coefficients = model.coef_
#
# Print our results, based on our test data of 5, 10, 15
print('predicted value:')
print(outcome)
print(' ')
print('coefficients:')
print(coefficients)```

### So, what about bringing my own data in (rather than randomly generating it)?

Below, I’ve done exactly that, using Pandas to read my CSV in, in addition to using Matplot Lib to show the linear regression plot.

```In [1]: import pandas as pd
In [4]: df
Out[4]:
0              1     10
1              2     20
2              3     30
3              4     40
4              5     50
5              6     60
6              7     70
7              8     80
8              9     90
9             10    100
#
#X and y are equal to the values in a single column of the dataframe
In [5]: X = df[['Hours Studied']]
#
In [7]: from sklearn.linear_model import LinearRegression
In [8]: model = LinearRegression().fit(X, y)
#
#Now we can test the model, based on 5 hours study time. The output is correct - 50% would be the expected grade.
In [9]: model.predict([[5]])
Out[9]: array([50.])
#
#Now, let's plot this!
In [10]: import matplotlib.pyplot as plt
In [11]: plt.scatter(X,y)
Out[12]: <matplotlib.collections.PathCollection at 0x1a18eed5c0>
In [13]: plt.show()```