Big Data Crash Course: MySQL


This section won’t focus too heavily on MySQL. It’s a very well documented, traditional relational database system. The focus of this article will be, why it’s part of a big data ecosystem.

To answer that question, we can use the below diagram, which shows a few different methods we can use to take data from its source and land it into storage. The purple line of the diagram shows the path that data destined for MySQL might take. The grey paths show the route that other data may take through the system.

What the diagram is intending to show is that there is more than one way to skin a cat. The ETL / ELT methods chosen will be dependent on the types of data you have – you may choose to transform the data, you may not and similarly, the storage medium for that data is also going to depend on your specific use case.

In the above diagram, I’ve shown three methods by which we may get data into MySQL. The first, is through a simple ETL process, during which we might drop fields, aggregate data and so on. The other two are highlighted with thick purple arrows and these are the scenarios on which I’d like to focus.

Let’s say that you own an online store and you’ve got tonnes of data being pumped into your Hadoop cluster. That data is derived from your CRM (Customer Relationship Management) system, Google Analytics, customer click streams, email marketing platforms, customer demographics, order management systems and accounting systems.  So your architecture looks a bit like the below diagram.

In the diagram, you can see that we’re taking lots of data from our sources, doing some ETL / ELT on the way into the cluster and dropping the data into HDFS. From there, our Spark job is picking the data up and deriving a ‘likelihood to purchase’ score and exporting ONLY those customers with a score over 9.

As a result of the Spark job, our data size has been decreased from 1 billion rows down to a much more manageable 500 rows. Why? Because only 500 customers meet the criteria set in our Spark job. These are the customers that we believe are very likely to buy our product.

Our Spark job dumps the results of this data into a MySQL table, which in turn is visible through the web interface for our sales people to call the potential clients. We’ve used MySQL for several reasons. Firstly, while the dataset we’re exporting may be a single table, we’re exporting it into a wider database structure, which looks like this:

By exporting our customer score into the purple table above, we’re able to integrate our predictive algorithm output with the rest of our existing web application.

MySQL enables fast querying in a relatively scalable environment. It works very well for web applications and can handle database sizes of up to a couple of terabytes very well. For this use-case it was the correct solution as it integrated with the existing solution and made minimal impact to the existing environment.