Big Data Crash Course: An Introduction To Kafka

Kafka is a low-latency messaging system that can have many producers and many consumers.

To expand on this, we can take a look at the below diagram. You can see that we have three data sources pushing data into Kafka. This data is then channeled through Kafka’s highly available and fault tolerant structure and waits to be pulled by the consumers of the feed.

Kafka producers write to a Kafka topic. For example, our CRM may write to the ‘customers’ topic while Google Analytics may write to a ‘Marketing’ topic. The consumers of the data can subscribe to one or many topics. So, in the above diagram, Storm may only subscribe to the customers topic while Spark streaming may subscribe to both customers and marketing data.

Kafka topics are split into partitions and distributed among Kafka brokers (servers) This enables us to parallelize topics. Each partition can be placed on a separate broker, which enables multiple consumers to read the same topic from multiple machines in parallel.  

We can also parallelize the consumers of a Kafka topic, whereby we enable them to read from multiple partitions of a topic at the same time – leading to very high message processing throughput.

Let’s look at the above diagram for a second. As you can see, each Kafka broker can hold several partitions. The purple blocks are the leaders, they grey blocks are the replicas. Whenever a write action happens to a topic, it’s written to the leader, which coordinates the replication to the replicas. If a leader fails, the replicas will take over as leader.

The above highlights that Kafka is highly available and resilient to node failures and supports automatic recovery – making it an excellent option and is why it’s become so popular within the industry as a whole.

So, what does Kafka give us then? Well, it gives us a highly resilient messaging platform with massive throughput potential; guaranteed message ordering (for a partition (not a topic)) and it guarantees that messages will not be lost, so long as one replica is alive.

Let’s summarize Kafka with a table. We will use the same table for Flume and NiFi.

Easy to add more producersYES
Easy to add more consumersYES
Fault tolerantYES
Complexities Producers and consumers must agree protocol, format and schema for data / messages.
Limitations / downfallsFlume and NiFi (discussed later) have many out of the box connectors to data sources. Kafka does not which can lead to higher development costs.