NiFi (formerly known as Niagara Files) has bought about a new level of flexibility into the Hadoop ecosystem. Like Kafka and Flume, it’s used to move data from one system to another but unlike those platforms, it does so in a very user friendly manner.
The NiFi user interface enables us to drag and drop for rapid ETL development. This makes it easy to add consumers, basic transformations and data prioritisation rules to our ingestion process.
In terms of prioritisation, NiFi gives us the flexibility to define our own prioritisation rules: These could be ‘First In First Out (FIFO)’; ‘Last In First Out (LIFO)’ or we could choose to deal with the largest or smallest files as a priority over others.
The user interface gives us the ability to visualise our data flows, which helps with identifying bottlenecks or unnecessary waste – it’s a nice touch. However, the real game changer with NiFi is the ability to make changes to a data flow without stopping the process. As soon as you commit your change, it’ll take effect immediately.
Within the interface, we have access to hundreds (currently around 200) NiFi connectors, which enable us to connect to many producers and consumers very quickly – further reinforcing the rapid development mindset that drives NiFi.
Another key feature of NiFi is its data provenance functionality. We can trace all data points back to their source – which is a very useful feature.
Now, let’s talk about security. NiFi enables two way encryption using SSL of the dataflow. Additionally, if the user enters a password or other sensitive information into the flow, NiFi will immediately encrypt it server-side and will never present it client-side again (even in its encrypted form).
NiFi also has a solid management and monitoring interface, which enables us to identify issues without diving into the log files. This feature can drastically reduce troubleshooting time and makes the overall management of the cluster far more manageable.
NiFi provides us with the ability to scale up (vertical scalability) and scale out (horizontal scalability). This is one of the features that lends NiFi to streaming data with potentially unbounded velocity and volume.
NiFi guarantees delivery of all messages. It pulls data from producers and pushes to consumers.
As with Flafka, there are combinations we can use with Kafka to mitigate the weak points from both solutions. The problem we often face with NiFi is that when the cluster starts to grow and starts consuming data from many sources, it brings about all sorts of load balancing and failover issues. So, we can utilize an asynchronous queueing system such as Kafka to simplify the ingestion from NiFi.
Let’s look at two NiFi-Kafka scenarios and with them, we will introduce MiNiFi. This is a sub project of Apache NiFi and enables us to collect data at its source. This enables us to ensure guaranteed delivery from source to the main NiFi node.
So, in the below diagram, MiNiFi would bring data from the sources to the central NiFi instance. This then passes the data onto a Kafka topic. Why is this useful? Well, without writing any code, we’ve been able to connect to the data sources and visually monitor and manage the pipeline before passing it into Kafka. This has reduced the Kafka development overhead significantly.
Another scenario in which we would couple NiFi and Kafka is where an organization already has a Kafka pipeline set-up. We may choose to use NiFi as a consumer to take data from Kafka to where it needs to go. Again, this enables us to add more data consumers without writing any code. We could also utilize the batching capabilities in NiFi to batch data together to send off to HDFS.
Let’s take a little look at NiFi. In the below, I’ve setup a simple flow that picks data up from the source SFTP server and moves it to the destination SFTP. When a new file is queued to go to the destination SFTP, an email is sent to an email recipient. This is a very simplistic example, however, it shows the simple NiFi user interface and the out of the box SFTP connectors.
Below, is a sample of the email sent upon receipt of new data.