Big Data Crash Course: HDFS


HDFS stands for Hadoop Distributed File System. A distributed file system manages files and folders across multiple servers for the benefit of resiliency and rapid data processing by utilizing parallelisation (as discussed in chapter one).

So what does that mean? Let’s take a look.

From the above, you can see three ‘blocks’ of data. Each is replicated across multiple racks. This creates fault tolerance as if one rack fails, the data would still be available. This also helps us to scale access to data by many concurrent users.

This setup does bring about some problems. For example, it becomes hard to manage changes to files & ensure that all locations are updated. For this reason, many platforms adopt a ‘write once’ mentality & save any changes as additional data sets.

In the past, we used to use parallel computing clusters, comprised of specialist hardware, costing millions of dollars to create & maintain. Now though, we can utilize commodity clusters. That is, a cluster with affordable, less specialized compute resources. This reduces costs significantly and makes big data a possibility for more organizations than ever before.

In the above, we have three networks. The blue network, red network and green network. These comprise our WAN (Wide Area Network). We are able to utilize distributed computing over a LAN or WAN environment, further adding to the resilience & fault tolerance of the solution.

Let’s summarize the benefits of using HDFS:

First off, HDFS is an extremely low cost solution. It can be deployed onto commodity, low cost hardware and can utilize the free open source Apache license. Of course, this comes without the niceties of support etc.. but depending on your use case, that may work for you.

HDFS allows us to enact the notion that we take compute power to the data. So in traditional data systems, data would be extracted from storage and processed on a central node. However, with HDFS, the data is process in-situ on its local node. All results are then aggregated together at the end to provide a single output. This is the core of the mapreduce concept. We’ll discuss this in more detail in the next chapter of this book.

HDFS also provides massive throughput to Map Reduce jobs. As quoted by Hortonworks, each node can send 2 gigabits per second to the mapreduce layer. So, if you had a 100 node cluster, that’d be 20 gigabits per second you could push through to mapreduce. Massive throughput is required for massive analytics!

HDFS also provides us with resilience by replicating data three times across different data nodes. HDFS is also rack aware which means it can distribute that data across different racks and even different geographies to protect from data loss when a location has issues.

Find out how to interact with HDFS here