Big Data Crash Course: Data Parallelisation

Before we get into what Hadoop is or isn’t good for, we need to discuss the concept of parallelisation as this is a core concept of big data.

We have two types of parallelisation: data and task. Data parallelisation is where we run the same function on the same dataset across multiple nodes (servers) while task parallelization is where we run many functions across many nodes and potentially different datasets at the same time.

The above diagram shows this concept nicely. You can see that in the left image, we have three nodes, one is running task one; the next is running task two and the final is handling task three. At the end of the process, all the results are aggregated together.

With the data parallelisation example (on the right) we have task one being handled by all of the nodes – each working on a subset of the overall dataset. The output is aggregated at the end

Task parallelization works well when we have multiple steps in our algorithm. A simple example would be an algorithm where we’re creating an average BMI (Body Mass Index) for the population of America. To do this, we need to first calculate the average height (task one), then calculate the average weight (task two) and finally we need to calculate the average age (task three). We could distribute these tasks across nodes & they could run in parallel. The results of each task would be aggregated at the end of the proces & the final average BMI would be calculated.