Big Data Crash Course: Hive


Apache Hive provides us with a familiar SQL-like query language to access and analyse the data stored in HDFS. Hive translates our SQL-like queries into mapreduce jobs – so even without programming experience, we can run jobs across many servers and benefit from data parallelisation.

Within Hive, we replicate a traditional relational database system. We have databases, tables and tuples (rows) and enforce a schema on Hive tables, defining column names, data types etc… All of this data is stored within the Hive metastore.

The Hive metastore (also known as HCatalog) is a relational database that stores metadata about objects in Hive.  This includes: Hive tables, column names and data types. This schema repository can be used by remote clients and other tools like Spark and Pig.

The Hive thrift server (HiveServer2), enables us to make JDBC and ODBC connectivity to Hive through remote clients, such as Tableau. This adds to the overall power of the Hive solution. Not only can we utilize Hive directly on the cluster but also through self-service analytics tools – making the data more accessible across the business.

We can utilize Hue (as below) to give a web based user interface through which we can access Hive. Again, this provides an additional level of usability, which will ensure a greater level of autonomy across the business and allow business users to form insights from the data.

As you can see from the further images below, we can use our very familiar SQL language to query our datasets in Hive.

Hortonworks recently developed Hive LLAP (Live Long and Process). This combats one of the major historic drawbacks of hive. It enables us to utilize in-memory caching to reduce the disk IO and ultimately make the queries run much faster. This is really suitable for those smaller datasets; traditional Hive on Tez is still suitable for large batch jobs.