Learn the ingestion, storage, analsis, security and management tools used in the big data environment, specifically Hadoop
While the term ‘big data’ is used a lot by small and large organizations alike, it doesn’t always mean that we’re talking the same language and that we share the same understanding of the technology and its benefits. As such, the ideal starting point of this course is to discuss the concept in a little more detail, ensuring that we have common understanding of the subject matter before we delve any further into the detail.
Big data can be defined through the use of the 6 V’s. They are Volume, Variety, Velocity, Veracity, Valence and Value
Structured data has the same attributes (fields) on each line. Much like an Excel document. It has a repeatable and predictable pattern of data throughout the file.
Let’s think about structured data in relation to our lives at work. Do you have a database storing all your customers details and each of the purchases that they make? This is structured data. Every customer record will have the same attributes (first name, surname, address, etc….).
We have two types of parallelisation: data and task. Data parallelisation is where we run the same function on the same dataset across multiple nodes (servers) while task parallelization is where we run many functions across many nodes and potentially different datasets at the same time.
Ingestion refers to the process of bringing external data into the big data cluster for further analysis. There are two types of ingestion: Extract, Transform, Load (ETL) and Extract, Load, Transform (ELT).
Extract, Transform, Load refers to the process of transforming or manipulating the data in some way as it enters the cluster. For example, we might choose to drop certain fields, aggregate or enrich data. This can also be referred to as ‘schema-on-write’.
Let’s get started with Sqoop – it’s the simplest component of the Hadoop ecosystem and can be described very quickly. Sqoop is a tool used for large-scale, bulk data transfer to/from a structured data source to HDFS or HBase.
Kafka is a low-latency messaging system that can have many producers and many consumers.
To expand on this, we can take a look at the below diagram. You can see that we have three data sources pushing data into Kafka. This data is then channeled through Kafka’s highly available and fault tolerant structure and waits to be pulled by the consumers of the feed.
Flume is similar to Kafka in many ways. It takes data from its source and distributes it to its destination. The differentiation here is that Flume is designed for high-throughput log streaming into Hadoop – it is not designed to ship out to a large number of consumers.
NiFi (formerly known as Niagara Files) has bought about a new level of flexibility into the Hadoop ecosystem. Like Kafka and Flume, it’s used to move data from one system to another but unlike those platforms, it does so in a very user friendly manner.
pache Storm provides us with a distributed, real-time computation platform. It has been designed to reliably process streams of large data with high velocity. To put this into something we can all appreciate, Storm was benchmarked processing one million tuples per second per node.
Apache Spark is a framework available in Hadoop. That framework is comprised of SparkSQL, Spark Streaming and Spark Machine Learning Library (MLLib). We will go through the entire framework in this section as without doing so, we will not be able to fully appreciate the benefits offered by the Spark framework.
This video covers AWS Kinesis at a high level. It’ll help you to understand the capabilities of AWS Kinesis, along with the different ingestion options that we have.
We have discussed earlier in this series the benefits of Flafka (Flume and Kafka) and NiFi coupled with Kafka. Here, we will review those configurations and also look at other ways we may choose to ingest data into our platform by combining multiple components.
HDFS stands for Hadoop Distributed File System. A distributed file system manages files and folders across multiple servers for the benefit of resiliency and rapid data processing by utilizing parallelisation (as discussed earlier).
HBase is NOSQL database that runs on top of HDFS. It’s suited to real time read and write access to large datasets that have a flexible schema. What does that mean?
MongoDB is a document oriented database and is one of the leaders of the NoSQL revolution. To summarize what a NoSQL database is, we can say that it’s a structure that enables us to consume data of varying structure (structured, semi-structured and unstructured), of varying schema and data type. They enable us to respond quickly to user demand through scalability and flexibility.
DynamoDB is a non-relational (NoSQL) database platform delivered by AWS. This video takes you through an overview of the service & its capabilities.
Click here to read more
This section won’t focus too heavily on MySQL. It’s a very well documented, traditional relational database system. The focus of this article will be, why it’s part of a big data ecosystem.
To answer that question, we can use the below diagram, which shows a few different methods we can use to take data from its source and land it into storage. The purple line of the diagram shows the path that data destined for MySQL might take. The grey paths show the route that other data may take through the system.
This video provides an overview of Amazon S3. It’ll enable you to talk with confidence with your colleagues about S3 & the various options available through the service. Click here to read more
Amazon RDS (Relational Database Service) is a fully managed relational database service in the cloud (e.g. MySQL PostgreSQL/ Auora). Fully managed means that AWS manages all the underlying hardware for you and handles the updating of software (like MySQL server versions).
Amazon EMR is the Amazon Hadoop Distribution. This video provides an overview of the service, along with the MapR distribution available through AWS. Click here to read more
Map Reduce is a parallel computing framework that enables us to distribute computing across multiple data nodes. Let’s look at an example. We have a file, that has the below text included in it.
Apache Hive provides us with a familiar SQL-like query language to access and analyse the data stored in HDFS. Hive translates our SQL-like queries into mapreduce jobs – so even without programming experience, we can run jobs across many servers and benefit from data parallelisation.
Python has become a leading technology in the data analysis space. It’s one of the most sought after skills & something that can help you advance your career in big data. The following series of articles will cover the language and libraries in detail, enabling you to get hands on with Python to tackle your own use cases.
Below is a quick overview of key Python functions. ‘….’ is used to show indentation.
Python is great for file analysis – we’ll be looking into more complex functions using the Pandas libraries in subsequent articles, but for now, let’s look at the out of the box functionality that we can use in Python.
NumPy stands for Numerical Python. It’s a library that contains a collection of routines for processing arrays.
An array, is a list. A multi dimensional array, is essentially a grid or table (it’s an array that contains 2 or more arrays).
Pandas series are one dimensional arrays (a list). An example one dimensional array is below:
In this article, we’ll look at some of the key components of PySpark, which is one of the most in-demand big data technologies at the current time.
When completing my domain normalisation project, I used Spark to do the heavy lifting – getting data in to a dataframe & aggregating (group by and sum) and then used Pandas for the domain manipulation. Finally, I converted my Pandas dataframe back to Spark, to write it to HDFS. Click here to read more
Sentiment analysis can provide key insight into the feelings of your customers towards your company & hence is becoming an increasingly important part of data analysis. Click here to read more
The below script shows how we may handle RFM segmentation with Python. RFM stands for Recency, Frequency and Monetary: Click here to read more
KMeans clustering searches for clusters of data within a dataset. This is an unsupervised learning model. If we look at plot 1 below, we can easily see the clusters of data – but we haven’t labeled the data (we haven’t told KMeans which cluster each datapoint belongs to). However, as you can see at the bottom of the page that the clusters have been correctly defined. Click here to read more
Let’s use the above example as our JSON script on which we’ll base our queries. Before we get started, let’s cover of the core components of a MongoDB query.
Machine learning uses statistical techniques to give computer systems the ability to ‘learn’ rather than being explicitly programmed. By learning from historical inputs. we’re able to achieve far greater accuracy in our predictions & constantly refine the model with new data. Click here to read more
Supervised learning is where we provide the model with the actual outputs from the data. This let’s it build a picture of the data and form links between the historic parameters (or features) that have influenced the output. To put a formula onto supervised learning, it would be as below, where, Y is the predicted output, produced by the model and X is the input data. So, by executing a function against X, we can predict Y. Click here to read more
The below is a logistic regression model, which uses some dummy data to determine whether people are at risk of diabetes or not – of course, this model couldn’t actually determine whether of not someone does have diabetes, it’s just a demonstration. Click here to read more
A decision tree builds a model in the form of a tree structure – almost like a flow chart. In order to calculate the expected outcome, it uses decision points and based on the results of those decisions, it’ll bucket each input. In this article, we’ll talk about classification and regression decision trees, along with random forests. Click here to read more
We discussed decision trees and random forests in quite a lot of detail here. This article will take you through a practical implementation, where based on historic data, we aim to predict future weather. The data for this model is continuous & hence requires a regression model, rather than a discrete classification model. Click here to read more
Below are the core components of a security policy in Hadoop:
Apache Ranger enables us to monitor and manage data security across our Hadoop cluster. It enables us to define a security policy for users or groups of the cluster once and apply it across all supported components in the Hadoop stack.
Kerberos authenticates users. It can be complex to configure – to make everything a lot simpler, we can carry out simplified kerberos setup, config and maintenance through Ambari.
Kerberos is a network authentication protocol. It is designed to provide strong authentication for client/server applications by using secret-key cryptography. It has the following characteristics: