Big Data Crash Course: Combining Ingestion Tools


We have discussed earlier in this series the benefits of Flafka (Flume and Kafka) and NiFi coupled with Kafka. Here, we will review those configurations and also look at other ways we may choose to ingest data into our platform by combining multiple components.

Below (in purple) is what I have named the ETL / ELT toolbox. Why? Well, imagine that you’re fixing a problem in your home. You reach for your toolbox. For some problems, you may only require a single tool while for others you may require several tools to get the job done. Big data is no different – sometimes our requirements are simple and can be handled by a single tool. Other times, we may find that we need to combine many tools in order to reach the desired outcome.

As you can see from the below I have included a three stage ETL / ELT pipeline – this will vary based on your use case. In some instances you may need to jump straight from stage 1 to storage. In other cases, you may need to include several steps before dropping the output into HDFS.

The below table outlines some of the benefits of combining various Hadoop components:

Flume + KafkaUtilize Flume’s out of the box connectors to connect to data sources with minimal development overhead.

This data is then passed to a Kafka topic and is replicated across a number of Kafka brokers, providing resilience & delivery guarantees.

We then utilize Kafka’s ability to write to many consumers – where Flume is most suited to writing data into HDFS..

Flume + Kafka + FlumeIn this scenario, we have all of the above, but we also take advantage of Flume’s optimized connectors to HDFS and HBase, providing optimal write speeds.
NiFi + KafkaUtilize MiNiFi to guarantee delivery – from collection at its source. NiFi enables us to collect this data and monitor / manage the data pipeline before passing it to Kafka. This reduces the Kafka development overhead significantly.
Kafka + NiFiUtilize Kafka to simplify ingestion into NiFi and use NiFi’s out of the box connectors to make delivery to consumers simpler.


There are of course combinations that include Spark streaming or Storm. However, the use case there is clear – Storm and Spark streaming provide heavy lifting and ETL for the ingested data. This is functionality not provided by NiFi, Kafka or Flume.