Big Data Crash Course: Spark Streaming


Apache Spark is a framework available in Hadoop. That framework is comprised of SparkSQL, Spark Streaming and Spark Machine Learning Library (MLLib). We will go through the entire framework in this section as without doing so, we will not be able to fully appreciate the benefits offered by the Spark framework.

Spark brings about a new way of handling big jobs on Hadoop. Traditionally, we would utilize Map Reduce jobs. However, these can be very slow due to heavy disk load. In fact, did you know that the average Map Reduce job shuffles into and out of memory at least three times? And that each Hive query is on average comprised of three to five Map Reduce jobs. That results in fifteen shuffles onto and off of the disk. Spark removes (or reduces) this issue by handling the job in memory where possible – we can configure whether it uses only memory, memory + disk or only disk.

In terms of speed, that results to being 100x faster than Map Reduce in memory and 10x faster when utilizing disk.

The above diagram shows the Spark framework in action. You can see in the centre of the diagram (during the ingestion phase), we have SparkSQL, Spark Streaming and MLLib. To the right, we have SparkSQL again during the analysis phase.

So what exactly is going on here? We will start by explaining the central section, Spark streaming.

Just like Apache Storm, Spark provides us with a distributed, real-time computation platform. It has been designed to reliably process streams of large data with high velocity.

At a very simple level, we can say that Spark streaming enables us to ingest data, transform it in some way and distribute it to its destination. The way it does this is a little different to Storm.

Spark utilizes the idea of micro batches based on a time limit. So, we could set the time limit to be one second. Any data that arrives during that period will be batched up and processed by Spark as a micro batch. This does add some latency to the process of course – however, in order to deliver the ‘exactly-once’ delivery requirement we may have, both Spark and Storm Trident utilize micro batches. We can set a batch size as low as half a second.

Spark works around RDD’s, these are small data structures that can be replayed when needed. This is at the core of the stateful processing provided by Spark. As you may remember from above, being stateful means, if you lose a worker or driver node, you can recover. Whatever data came in during the downtime period, will be replayed.

So we can say in many respects, Spark Streaming operates very similarly (or at least achieves the same result) as Spark. The bit that really sets it apart is its ability to interact with the other elements of the Spark framework. Let’s look again at the above diagram, why does SparkSQL interact with the streaming block.

SparkSQL enables us to query data as it goes through the streaming pipeline. We don’t need to wait for it to land in HDFS, we can garner real time insight into the data that is being processed. We call this interactive querying.

Now, we can also see that Spark Streaming interacts with the machine learning libraries. The below diagram is going to help us to understand what happens during this part of the process.

In the above, we can see that a datasource, feeds into a model. Initially, this will be our ‘training’ data, which is historical data we can use to help the model to learn the likely outcomes of a particular scenario. From here, you can imagine that the model draws a chart and adds a line of best fit. From here, it can estimate what the likely Y value will be with an input of X.

What happens now is, we continually pump data into the model, further helping it to refine what it’s doing. The model then passes its ‘prediction’ to the learner. The learner has access to the actual results or ‘the truth’ and the model’s prediction. It is then able to compare the two and make some adjustment to the parameters in the model. The cycle repeats over and over until we reach a model accuracy level that we’re happy with.

We can look at it a slightly different way – more conceptually in the below diagram. The top row, we have training data feeding the model. That model, once validated, is passed to be used on ‘real’ data and make a prediction. The real life ‘actual’ results will be passed back into the training data pool to further refine the model.

So why is interacting with the various elements of the Spark framework so helpful? Well, it means that the codebase maintenance is far easier. We can create one codebase and reuse it across many different scenarios.

The final point on Spark intends to give a little insight into what the ‘driver’ and ‘worker’ nodes that have been mentioned to this point are.

The above diagram shows the setup of a Spark job. We have the driver node, which effectively holds all the code we’ve written – our algorithms and we have the worker nodes, which do all the heavy lifting.

This setup enables us to adopt a parallel processing model with many workers processing our data at the same time.