Big Data Crash Course: An Introduction To Flume / Flafka

Facebook
Google+
Twitter
LinkedIn
Tumblr

Flume is similar to Kafka in many ways. It takes data from its source and distributes it to its destination. The differentiation here is that Flume is designed for high-throughput log streaming into Hadoop – it is not designed to ship out to a large number of consumers.

Flume, like Kafka is distributed, reliable and highly available, however, it’s not quite as slick as the method adopted by Kafka.

Let’s say for example, we have an HTTP data source flowing into Flume. We want to make it highly available and ensure that if a flume agent were to go down, we would still have access to the data.

The only way to do this in flume would be to create two source machines that could be written to in parallel by the HTTP source. You’d then need to tag the data as it traverses through the Flume process so that we can de-duplicate the data in the sink.

As you can imagine, the de-duplication phase of the Flume cycle adds some fairly unnecessary overheads.

However, Flume more than redeems itself in its ease of use. With Flume, we have plenty of tried and tested, out of the box connectors, removing the development overhead that is present with Kafka.

There is a concept called ‘Flafka’ through which we exploit the strengths of both Kafka and Flume to create a more robust ingestion process.

By deploying Flafka, we take advantage of the out of the box connectors available in Flume and use it to connect to all of our data sources. The data is then passed to a Kafka topic. As we discussed earlier, Kafka then replicates the topic partitions across a number of Kafka brokers (servers). Kafka then distributes the data to many consumers. As discussed above, Flume is best at moving data somewhere in Hadoop (HDFS and HBase), so Kafka brings about some additional flexibility by writing to many more consumers.

Once Flume passes the message to Kafka, we can guarantee that it won’t be lost. So, Flafka utilizes the flexibility of Flume and the industrialised resilience of Kafka.

We can use flume on one end of the ingestion process, or both. In the above diagram, Flume is used to connect to data sources. Consumers then pull data from Kafka. However, as below, we could use Flume on both ends of the process – taking advantage of its connectors to not only data sources but also its optimized connectors to HDFS and HBase.

Your use-case will determine which of the tools you use and in what combination. The below table provides a summary of the key Flume concepts. Many of its limitations can be worked around through combination with Kafka.

Easy to add more producersYES – lots of out-of-the-box consumers
Easy to add more consumersNO – may require significant re-work
Fault tolerantNO – but can be (with some difficulty)
ScalableHorizontally
Complexities Only ideal for streaming into Hadoop
Limitations / downfallsComplexities in building in fault tolerance

Built primarily for moving data to somewhere within Hadoop