While the term ‘big data’ is used a lot by small and large organizations alike, it doesn’t always mean that they’ve got a firm grasp on the concept of the technology and its benefits. As such, the ideal starting point of this post is to discuss the concept in a little more detail, ensuring that we have common understanding of the subject matter before we delve any further into the detail.
To quote SAS (source), “Big data is a popular term used to describe the exponential growth and availability of data, both structured and unstructured. And big data may be as important to business – and society – as the Internet has become. Why? More data may lead to more accurate analyses.”
To understand why data grows at an exponential rate, we must look at a case study, for the purpose of this post, we will discuss Ebay.com. Ebay may have many more competitors than they did at launch, but they are still a major player in the ecommerce market, with 157 million active user accounts each quarter (source).
Each activity that a user carries out on their website is logged in their database. That means, each time you look at an item, buy an item or watch an item, Ebay takes that information and stores it in their database. If we assume that each of those 157 million user accounts carry out at least 10 actions per quarter then you can start to see how data can grow at such an uncontrollable rate – in this example, 1,570,000,000 new actions being stored in their database each quarter.
Ebay is an extreme example of how data can continue to grow and can snowball out of control which leads to the question ‘do I have big data?’
Unfortunately, there are no set parameters which will enable you to calculate if your data set should be considered to be ‘big data’, however, there are some factors we can apply to conclude whether you do indeed have a big data project on your hands.
The first thing to note is that the definition of big data will be influenced by the capabilities of the organization that owns the dataset. For example, for a small company, with only Microsoft Excel at it’s disposal, 10GB of data could be an overwhelming amount to analyse and could take a long time to process. Conversely, global organizations, such as Amazon, will process thousands of gigabytes of data each hour – so to them, 10GB would not be considered a big data project at all.
Along the same lines, drawing information from a local database and visualizing that data to gain insight may be considered as a big data project by a small organization. However, in larger companies, data will be drawn from multiple sources, which may include: local databases; Microsoft Excel files; web services (such as Google Analytics) and plenty more. Once again, this comes down to your company’s capacity to implement big data solutions that draw upon multiple data sources.
At Netshock, we consider that if your data grows by 100% year on year and has a variety of data sources, then you are well within your rights to consider your project to be a big data project.
So, what is Hadoop then?
Hadoop is a very scalable big data solution. It is designed to work on the cheapest of infrastructure and can be extended out to run thousands of nodes, processing thousands of terabytes of data. To put that into cost terms, there have been case studies where companies are running for hundreds (rather than tens of thousands) of pounds per terabyte of data – which is incredible.
By adopting a distributed file system, we can drastically improve performance as multiple servers are able to analyze data in parallel.
Hadoop also improves data redundancy, as data is replicated across multiple nodes, meaning if one node fails the data is still available to the end user.
It should be noted that Hadoop is ‘write once, read many’. What this means is, once the data has been written to Hadoop, it cannot be modified. This is useful for data that can be archived (such as bank transactions, which will never need to be updated).
Why is Hadoop needed?
Current RDBMS (Relational Database Management Systems) have certain limits. Those limits generally centre around scalability – that is the ability to rapidly increase the amount of data being ingested and stored in the database. While it is possible to scale a traditional RDBMS to be Terabytes and possibly Petabytes of data, it becomes very expensive to provision new hardware and even spending hundreds of thousands of pounds on the system will not alleviate you of the ability to ingest large amounts of data.
There is a concept called CAP (Consistency, Availability and Partitioning) theory used when discussing databases, I’ve explained each of the CAP elements below:
Consistency: In many cases (such as line of business transactions), a business needs to ensure extremely high consistency levels. For example, let’s say that you want to transfer money from your savings account to your current account. That process will be carried out as two transactions. In this case, you want both transactions to be completed successfully, failing that, neither should be successful – otherwise you’d have lost your money! Hadoop is not good at storing this data. It is not designed for transactional data. Rather, it is designed for batch processing.
Availability: This is the availability of the data stored in the database and the ability to replicate data for resilience.
Partitioning: Also referred to as scalability, is the ability to split data across multiple machines, taking advantage of multiple processors for scalability.
Traditionally, RDBMS databases are very good at Consistency and Availability, but fall short when it comes to partitioning. Hadoop is very good at Partitioning and Availability, but falls short when it comes to consistency. As such, Hadoop should be considered as an addition, rather than a replacement to your traditional RDBMS.
The reason that Hadoop is so strong in the partitioning and availability sections of the CAP theory is simple – it’s designed to run on very cheap and potentially old hardware, meaning that you can rapidly expand your cluster on a budget. As such, companies that collect endless amounts of data (like Facebook and Yahoo) utilize Hadoop as their database system of choice.
Hadoop is very good at behavioural data management. As an example, an online banking system will store user’s successful transactions, however it may also be useful to see how long the user spent on each page, where they moved their mouse and how they interacted with the system to assess usability and ultimately make improvements to the system.
In terms of Availability, Hadoop automatically creates 3 copies of the data it stores, across different machines. If one machine fails, the data is still available through the other 2 machines. Once that machine is replaced, it is automatically re-populated with data.
What are the components of Hadoop?
HDFS is the default file system used for Hadoop. It’s main feature is that the data held within it is triple replicated by default – if you were deploying in AWS, you would use S3 as your HDFS..
Map Reduce – discussed below
Hbase – database suited for data where each row may be slightly different
HIVE – SQL-like query language
Pig – Scripting ETL (Extract Transform Load) processing
Mahout – Machine learning / predictive analytics
Oozie – Workflow / Co-Ordination of jobs
ZooKeeper – Co-ordination of groups of jobs
Sqoop – data exchange with other systems (like SQL)
Flume – Log collector
Ambari – Provisioning, managing, monitoring clusters
What is Map Reduce?
IBM defines MapReduce as the ‘heart’ of Hadoop. It’s a programming paradigm that enables massive scalability across hundreds or thousands of servers in a Hadoop cluster.
MapReduce can be broken down into two distinct tasks. Firstly, the map task filters and sorts data to produce a consolidated data set. Looking at the below sample data, we may script the map task to produce an output file which would show the max temperature for each city that appears in the file.
|1 Jan||New York||20|
|2 Jan||New York||15|
|3 Jan||New York||21|
|4 Jan||New York||13|
The map task would do this for each node in the Hadoop cluster in parallel, so if you had 500 nodes in your Hadoop cluster, you’d output 500 individual files, all including the max temperature for each city from their respective node.
Those 500 output files would then feed into the reduce task. In this step, each of the output files would be aggregated into a single file, whereby a single max figure would be given for each city.
The benefit of the map reduce task is that it takes data from disparate systems (multiple Hadoop nodes), analyses each, in parallel and then reduces the subsequent query results to a single figure.
Components Involved In MapReduce:
|Job Tracker||MapReduce jobs are submitted to the job tracker. It then schedules client jobs and manages the overall execution of the map reduce job.
The job tracker keeps track of the consumed and available resources on the system.
The job tracker is a point of failure for the Hadoop MapReduce service. If it fails, all running jobs are halted.
|Task Tracker||The task tracker executes and manages the individual tasks allocated by the job tracker.
The task tracker sends a heartbeat to the job tracker every few seconds.
Task trackers have a set of slots. The number of slots on a task tracker indicate how many jobs it can accept. When the job tracker looks for a task tracker, it looks for one on the same server as the data, failing that, in the same rack as the data.
The task tracker communicates the status of the job to the job tracker.