Before we get started looking into the technicalities of Hadoop, it’s worthwhile discussing the limits of the current databases and the subsequent need for a Hadoop solution.
Current RDBMS (Relational Database Management Systems) have certain limits. Those limits generally centre around scalability – that is the ability to rapidly increase the amount of data being ingested and stored in the database. While it is possible to scale a traditional RDBMS to be Terabytes and possibly Petabytes of data, it becomes very expensive to provision new hardware and even spending hundreds of thousands of pounds on the system will not alleviate you of the ability to ingest large amounts of data.
There is a concept called CAP (Consistency, Availability and Partitioning) theory used when discussing databases, I’ve explained each of the CAP elements below:
Consistency: In many cases (such as line of business transactions), a business needs to ensure extremely high consistency levels. For example, let’s say that you want to transfer money from your savings account to your current account. That process will be carried out as two transactions. In this case, you want both transactions to be completed successfully, failing that, neither should be successful – otherwise you’d have lost your money! Hadoop is not good at storing this data. It is not designed for transactional data. Rather, it is designed for batch processing.
Availability: This is the availability of the data stored in the database and the ability to replicate data for resilience.
Partitioning: Also referred to as scalability, is the ability to split data across multiple machines, taking advantage of multiple processors for scalability.
Traditionally, RDBMS databases are very good at Consistency and Availability, but fall short when it comes to partitioning. Hadoop is very good at Partitioning and Availability, but falls short when it comes to consistency. As such, Hadoop should be considered as an addition, rather than a replacement to your traditional RDBMS.
The reason that Hadoop is so strong in the partitioning and availability sections of the CAP theory is simple – it’s designed to run on very cheap and potentially old hardware, meaning that you can rapidly expand your cluster on a budget. As such, companies that collect endless amounts of data (like Facebook and Yahoo) utilize Hadoop as their database system of choice.
Hadoop is very good at behavioural data management. As an example, an online banking system will store user’s successful transactions, however it may also be useful to see how long the user spent on each page, where they moved their mouse and how they interacted with the system to assess usability and ultimately make improvements to the system.
In terms of Availability, Hadoop automatically creates 3 copies of the data it stores, across different machines. If one machine fails, the data is still available through the other 2 machines. Once that machine is replaced, it is automatically re-populated with data.
There are a number of components that are often implemented with Hadoop. These are outlined below:
HFDS (Hadoop Distributed File System)
Map Reduce – Processing API
Hive – query language
Pig – query language
One to note from the above is HBase, which is a NoSQL database solution, helping to make sense of data extracted from the large Hadoop data store. It uses wide column structure, which essentially means, the column will be as big as it needs to be and there is no requirement for a specific data structure.
Name: Bob Car: Toyota Colour: Red
Location: Slough House Number: 124 Car: Fiesta
As can be seen above, each record in the database does not necessarily have all the same data points.
There are three main categories of Hadoop Distributions. Firstly, there is the original, open source distribution provided by Apache. This particular distribution is developed rapidly and has an aggressive release schedule. It’s best practice to always remain a couple of versions behind the latest release to ensure that all bugs have been worked out before upgrading.
There are also commercial distributions. In the same way that Ubuntu builds their own layers on top of a standard Unix kernel, Cloudera, Horton Works and MapR build their own features and tools on top of the open source Hadoop distribution. These tools may include monitoring libraries).
Finally, we have cloud distributions which can be deployed on Amazon Web Services as an example. It is possible to deploy both commercial and open source Hadoop distributions to the cloud, however, some commercial distributions may not be supported by all cloud providers.
Hadoop is suited to:
Credit fraud detection
Behavioural analysis for UX