Big Data Crash Course: Map Reduce

Map Reduce is a parallel computing framework that enables us to distribute computing across multiple data nodes. Let’s look at an example. We have a file, that has the below text included in it.

Line 1Welcome to the Map Reduce section of our big data post. You’ll learn all about Map Reduce!
Line 2Once your data nodes have mapped the data, it’ll be summarized in the shuffle node stage.

Our task is to find out how many times each word is repeated in the entire document. For simplicity’s sake, let’s say that our file is split into two pieces and stored on two data nodes by HDFS.

In the first step, the data nodes are going to individually count the appearances of each word and create key value pairs, as shown in the data nodes column below.

Then, we move to the shuffle node column, where we move all the same words to the same node. So, if node 1 and 2 both include the word ‘data’, we should move them to the same node in the shuffle phase.

We then have the reduce column, which adds up all the occurrences & combines them into a single list. You can see the highlighted values ‘map’, ‘reduce’, ‘data’ and ‘the’ which have been counted.

Map Reduces three stages are: map, which is where we have parallelization over the inputs (i.e. multiple processes happen across multiple machines in parallel). We then shuffle / sort the data to the correct nodes and we then have the reduce section, in which we group or aggregate the data into an output.

From the above, we can see that Map Reduce is incredibly useful in some situations. It is however, not so useful for any kind of task that requires interactive analysis (that is, where data starts coming through immediately) as Map Reduce must complete the entire job before any results are shown.

Line of textDATA NODESSHUFFLE NODESREDUCEOUTPUT FILE
1(Welcome,1)

(to, 1)

(the, 1)

(map, 1)

(map, 1)

(reduce, 1)

(reduce, 1)

(section, 1)

(of, 1)

(our, 1)

(big, 1)

(data, 1)

(post, 1)

(you’ll, 1)

(learn, 1)

(all, 1)

(about, 1)

(Welcome,1)

(to, 1)

(map, 1)

(map, 1)

(reduce, 1)

(reduce, 1)

(section, 1)

(of, 1)

(our, 1)

(big, 1)

(post, 1)

(you’ll, 1)

(learn, 1)

(all, 1)

(about, 1)

(Welcome,1)

(to, 1)

(map, 2)

(reduce, 3)

(of, 1)

(our, 1)

(big, 1)

(post, 1)

(you’ll, 1)

(learn, 1)

(all, 1)

(about, 1)

(once, 1)

(your, 1)

(data, 3)

(nodes, 1)

(have, 1)

(mapped, 1)

(the, 3)

(it’ll, 1)

(be, 1)

(summarized, 1)

(in, 1)

(shuffle, 1)

(node, 1)

(stage, 1)

Output file as defined in the Map Reduce job.
2(once, 1)

(your, 1)

(data, 1)

(data, 1)

(nodes, 1)

(have, 1)

(mapped, 1)

(the, 1)

(the, 1)

(it’ll, 1)

(be, 1)

(summarized, 1)

(in, 1)

(shuffle, 1)

(node, 1)

(stage, 1)

(once, 1)

(your, 1)

(data, 1)

(data, 1)

(data, 1)

(nodes, 1)

(have, 1)

(mapped, 1)

(the, 1)

(the, 1)

(the, 1)

(it’ll, 1)

(be, 1)

(summarized, 1)

(in, 1)

(shuffle, 1)

(node, 1)

(stage, 1)

Coding a Map Reduce job can be quite complex however, we can interface with Map Reduce through Hive and Impala which adds a great deal of simplicity to the Map Reduce development process.

Let’s consider Map Reduce in the wider context. In a global organization with a 500 node Hadoop cluster, we can parallelise data analysis across 500 servers. That means, subsets of that data are being crunched at the same time by different machines – leading to far faster output than traditional methods of analysis.