Big Data Crash Course: Apache Storm

Facebook
Google+
Twitter
LinkedIn
Tumblr

Apache Storm provides us with a distributed, real-time computation platform. It has been designed to reliably process streams of large data with high velocity. To put this into something we can all appreciate, Storm was benchmarked processing one million tuples per second per node.

A key benefit of the Storm framework is that it can be developed against in almost any programming language making it a very accessible solution.

Storm is fault tolerant. If a worker instance dies, Storm will automatically restart the instance This is referred to as ‘failing fast and auto-restarting’. Storm guarantees data will be processed and has several options ‘at most once’, ‘at least once’ or ‘ exactly once’. Messages only replay through Storm when there are failures that need to be resolved.

Storm has the concept of spouts and bolts. Spouts are the streams of data coming into the framework and bolts are the processing and output – that is to say, spouts bring data into the platform and bolts do the main computational work.

Bolts can do filtering, functions, aggregations, joins and many more functions. You may be wondering why we’ve daisy chained multiple bolts together in the below diagram. Well, each bolt can do simple transformations. More complex transformations may require multiple steps and hence multiple bolts.

For example, let’s say that we’re looking to display images of the CEO’s of the top 100 companies. The first step would be to utilize the financial data of the companies to define who the top 100 were. The second bolt would be to stream out the top 100 images.

One of the benefits of Storm which enables us to get up and running quickly is that it has lots of out of the box spouts available for us to use. If your data source isn’t already developed, you can do so but if it’s already available, it can reduce development time significantly.

We would use Storm for the delivery of real-time analytics, machine learning, customized customer recommendations (like on Amazon) and ETL to name just a few.

If you’re already aware of Spark streaming you’re probably wondering when we would choose to use Storm and when we would choose to use Spark. If you’re not already aware, we cover Spark streaming in the next section.

StormSpark STREAMING
Broad language support requiredYES
Stream ProcessingYESYES
Exactly once delivery requirementREQUIRES TRIDENTYES
Processing ModelEvent

(Micro-batching when using Trident)

Micro-Batching
Latency / speedMilliseconds

Much faster for smaller data sets

Seconds
ThroughputSignificantly more than Storm in same time window
Resilience YES
Working with KafkaLower CPU and hence higher Kafka throughput potential
IntegrationsWith other Storm libraries / components

In conclusion, there is no cut and dry answer as to which of the two you should use. Spark Streaming and Storm both have their benefits and performance figures vary greatly across different benchmark experiments. As such, testing the two side-by-side in your environment and on your own dataset would be the best thing to do. That said, if your Kafka brokers are CPU constrained or you need to work with the other Spark libraries (such as machine learning) then Spark would be the right choice.

Both are stateful. This means, if you lose a worker or driver node, you can recover. Whatever data came in during the downtime period, will be replayed.