Big data series: distributed file systems & Hadoop ecosystem

TechTerms.com defines a distributed file system as “A DFS manages files and folders across multiple computers. It serves the same purpose as a traditional file system, but is designed to provide file storage and controlled access to files over local and wide area networks. Even when files are stored on multiple computers, DFS can organize and display the files as if they are stored on one computer” (source).

So what does that mean? Let’s take a look.

From the above, you can see three ‘blocks’ of data. Each is replicated across multiple racks. This creates fault tolerance as if one rack were to go down, the data would still be available. This also helps us to scale access to data by many concurrent users.

This setup does bring about some problems. For example, it becomes hard to manage changes to files & ensure that all locations are updated. For this reason, many platforms adopt a ‘write once’ mentality & save any changes as additional data sets.

So, in the past, we used to be able to use parallel computing clusters, comprised of specialist hardware, costing millions of dollars to create & maintain. Now though, we can utilize commodity clusters. That is, a cluster with affordable, less specialized compute resources. This reduces costs significantly.

 

In the above, we have three networks. The blue network, red network and green network. These comprise our WAN (Wide Area Network). We are able to utilize distributed computing over a LAN or WAN environment, further adding to the resilience & fault tolerance of the solution.

We consider Data parallelism to be the notion of job sharing, where particular nodes (servers) can work on different datasets or part of the same dataset. This leads to greater scalability, performance and cost reduction.

What about programming models for big data?

Programming models are comprised of runtime libraries and programming languages. When we look for a big data programming model, we need one that can:

  • Split large volumes of data to distribute analysis across multiple nodes and synchronise the output later
  • Access data fast
  • Distribute computation to nodes (taking the compute resource to the data)
  • Schedule parallel tasks
  • Replicate data partitions
  • Recover files then required
  • Enable adding more nodes and racks (scaling the cluster out)

One such programming model is Map Reduce. Let’s look at an easy example of how this works. Let’s say that we have a pile of playing cards. On some of the cards, we have a picture of a bird, others have a picture of a pig and the final image is of a horse.

So, you give three people a random selection of cards & ask them to sort those cards into piles, count up the total number of each animal & then place them in a centralized bucket.

So, here are our three participants & their total counts:

So here is an example of Map Reduce. Our individuals worked separately, conducted separate calculations & then summed the totals from their individual calculations

So, where does Hadoop fit in?

Yahoo developed Hadoop in 2005 based on the Google map reduce framework.  Since then, hundreds of open source tools have been released to work with Hadoop, we’ll cover a few below.

First off, let’s consider how Hadoop hits all the marks when it comes to big data. It:

  • Enables scalability for large data volumes
  • Supports a fault-tolerant infrastructure with graceful recovery
  • Is optimized for a variety of data types
  • Enables multiple jobs to execute simultaneously
  • Provides value to the business – it’s open source & community supported
  • Supports map reduce

The open source applications you may have heard of. The core components of Hadoop are coloured in blue:

HDFS:

The Hadoop Distributed File System takes a file, splits it into 64MB (by default) chunks and stores it across multiple nodes in different racks and replicates it three times, which leads to data resilience.

HDFS is comprised of two major components: a name node & data nodes. We usually only have one name node, which holds the metadata about the data we hold (location etc… of the data).

Hortonworks has noted that HDFS is suitable for up to 200PB of data across 4.500 nodes – showing the true scalability of the solution.

YARN:

Yet Another Resource Negotiator is the resource management layer of the Hadoop cluster. It schedules jobs across all nodes and is comprised of: resource manager, application manager, node manager and the containers (CPU resources).

Resource manager: decides what nodes will do the work
Node manager: the machine level management
Application master: negotiates resources from the resource manager and co-ordinates node managers to get the job done

Ultimately, YARN gives us a way to utilize a number of applications in Hadoop that weren’t possible in Hadoop 1.0. It leads us to better cluster utilization and lower costs.

Map Reduce:

A parallel computing framework that enables us to distribute computing across multiple data nodes. So, let’s look at an example. We have a file, that has the below text included in it.

Our task is to find out how many times each word is repeated in the entire document.

Now, for simplicities sake, let’s say that our file is split into two pieces and stored on two data nodes by HDFS.

In the first step, the data nodes are going to individually count the appearances of each word and create key value pairs, as shown in the blue column below.

Then, we move to the orange column, where we move all the same words to the same node. So, if node 1 and 2 both include the word ‘data’, we should move them to the same node in the shuffle phase.

We then have the green column, which adds up all the occurrences & combines them into a single list. You can see the highlighted values ‘map’, ‘reduce’, ‘data’ and ‘the’ which have been counted multiple times across the two lines of text

So we have three stages:

  • Map, where we have parallelization over the inputs (i.e. multiple processes happen across multiple machines in parallel).
  • Shuffle / sort, which has parallelization over the sorting of the data.
  • Reduce, which has parallelization of the data grouping.

So, we can see that map reduce is incredibly useful in some situations. It is however, not so useful for any kind of task that requires interactive analysis (that is, where data starts coming through immediately) as map reduce must complete the entire job before any results are shown.

Other components:

  • Hive: an SQL-like interface to query data
  • Pig: enables complex data transformation scripts
  • Giraph: processing large scale graphs. Facebook uses this for social graphs of users
  • Storm: in-memory processing. Up to 100x faster for some jobs
  • Spark: in-memory processing. Up to 100x faster for some jobs
  • Flunk: in-memory processing. Up to 100x faster for some jobs
  • HBase: is a column-oriented DBMS that runs on top of Hadoop
  • Cassandra: a distributed No-SQL database
  • MongoDB: a document-oriented No-SQL database
  • Zookeeper: the co-ordination of groups of jobs
  • Oozie – Workflow / Co-Ordination of jobs
  • Sqoop – data exchange with other systems (like SQL)
  • Flume – Log collector
  • Ambari – Provisioning, managing, monitoring clusters