Big Data Crash Course: HBase


HBase is NOSQL database that runs on top of HDFS. It’s suited to real time read and write access to large datasets that have a flexible schema. What does that mean?

Well, let’s go back and look at our semi-structured data model from chapter one. You can see here that each set of values is different. We have two values for user 182 and three values for customer 183.

In some systems, we allow user-defined parameters, so data is likely to get messy and we’re going to find that our data doesn’t follow such a predictable structure. As such, we need a database solution that is more flexible and adaptable to the requirements of our specific application.

HBase is a key value store database which means you’ll achieve very fast reads if you’re looking up a user on a particular key. For example, in the below, we could look up the customer based on their telephone number (the key) and access all the underlying details about that customer (the value).

Key: 078276387126

Value: “Name: Michael, Type: Employee; age: 30; salary: 4000”

This rapid access through key-value lookups may not fit your model / way of working. But in the event that it does, it’s a very powerful concept.

Resilience is key in bit data initiatives and HBase provides some solid functionality around this area. It replicates data in a master / slave setup and enables automatic failover from master to slave in the event of an issue. This failover can happen across a WAN / LAN which adds geographic resilience to the solution.

Couple this resilience with the ability to scale horizontally without any database redesign and the ability to do some in-memory caching and we’re starting to build a good picture of HBase benefits.

HBase offers immediate consistency. That means, the data written to the database is immediately available after write. This is not the case with HDFS, which offers eventual consistency. This is because HDFS replicates data across multiple nodes. If the data is written to node one and before it’s copied to node two a user has queried node two, they will not see the latest data.

HBase offers automatic partitioning / sharding. Sharding is the process of breaking large data sets into smaller, more manageable chunks (called data shards). This enables us to adopt a parallelized processing model.

The final point to make here is that HBase can leverage mapreduce which enables us to utilize it for both real-time data reads and also large batch jobs, facilitated by mapreduce.