Analytics & Machine Learning

This section will take you through an introduction to many statistical models, an introduction to Python & then onto machine learning topics.


Python has become a leading technology in the data analysis space. It’s one of the most sought after skills & something that can help you advance your career in big data. The following series of articles will cover the language and libraries in detail, enabling you to get hands on with Python to tackle your own use cases.

Click here to read more

Below is a quick overview of key Python functions. ‘….’  is used to show indentation.

Click here to read more

Python is great for file analysis – we’ll be looking into more complex functions using the Pandas libraries in subsequent articles, but for now, let’s look at the out of the box functionality that we can use in Python.

Click here to read more

NumPy stands for Numerical Python. It’s a library that contains a collection of routines for processing arrays.

An array, is a list. A multi dimensional array, is essentially a grid or table (it’s an array that contains 2 or more arrays).

Click here to read more

Pandas series are one dimensional arrays (a list). An example one dimensional array is below:

Sentiment analysis can provide key insight into the feelings of your customers towards your company & hence is becoming an increasingly important part of data analysis. Click here to read more

The below script shows how we may handle RFM segmentation with Python. RFM stands for Recency, Frequency and Monetary: Click here to read more

Machine Learning is a growing field & it seems that almost every company are deploying machine learning algorithms to solve problems. The question is, is machine learning actually necessary, or is it just so companies can say ‘I’m doing that too’ ?. Click here to read more.


In this article, we’ll look at some of the key components of PySpark, which is one of the most in-demand big data technologies at the current time.

Click here to read more

When completing my domain normalisation project, I used Spark to do the heavy lifting – getting data in to a dataframe & aggregating (group by and sum) and then used Pandas for the domain manipulation. Finally, I converted my Pandas dataframe back to Spark, to write it to HDFS. Click here to read more


Map Reduce is a parallel computing framework that enables us to distribute computing across multiple data nodes. Let’s look at an example. We have a file, that has the below text included in it. Click here to read more

Apache Hive provides us with a familiar SQL-like query language to access and analyse the data stored in HDFS. Hive translates our SQL-like queries into mapreduce jobs – so even without programming experience, we can run jobs across many servers and benefit from data parallelisation. Click here to read more


Machine learning uses statistical techniques to give computer systems the ability to ‘learn’ rather than being explicitly programmed. By learning from historical inputs. we’re able to achieve far greater accuracy in our predictions & constantly refine the model with new data. Click here to read more

Supervised learning is where we provide the model with the actual outputs from the data. This let’s it build a picture of the data and form links between the historic parameters (or features) that have influenced the output. To put a formula onto supervised learning, it would be as below, where, Y is the predicted output, produced by the model and X is the input data. So, by executing a function against X, we can predict Y. Click here to read more

A decision tree builds a model in the form of a tree structure – almost like a flow chart. In order to calculate the expected outcome, it uses decision points and based on the results of those decisions, it’ll bucket each input. In this article, we’ll talk about classification and regression decision trees, along with random forests. Click here to read more

Regression aims to predict the numeric value of something, given a set of input parameters. For example, the we could approximate the price of a car, given its mileage, age, brand, MOT status, etc.. In this simple example, we’re going to predict the output value, based on three randomly generated input variables. In our real-world example variables could be mileage, age and miles since last service. Click here to read more

In the table above, A is the constant (the Y intercept), also known as B0. X is the X multiplier (also known as B1). So in our equation Y = B0 + B1 (X); we can substitute B0 and B1 for the values in the coefficient column of the table.

The standard error column (C & D) tells us how accurate our predictions are. The lower the value, the more accurate the prediction will be. Click here to read more

The below is a logistic regression model, which uses some dummy data to determine whether people are at risk of diabetes or not – of course, this model couldn’t actually determine whether of not someone does have diabetes, it’s just a demonstration. Click here to read more

KMeans clustering searches for clusters of data within a dataset. This is an unsupervised learning model. If we look at plot 1 below, we can easily see the clusters of data – but we haven’t labeled the data (we haven’t told KMeans which cluster each datapoint belongs to). However, as you can see at the bottom of the page that the clusters have been correctly defined. Click here to read more

We discussed decision trees and random forests in quite a lot of detail here. This article will take you through a practical implementation, where based on historic data, we aim to predict future weather. The data for this model is continuous & hence requires a regression model, rather than a discrete classification model. Click here to read more