Getting started with Big Data

This learning path will guide you through the ETL, storage, analysis, management and security aspects of Big Data.

INTRODUCTION

While the term ‘big data’ is used a lot by small and large organizations alike, it doesn’t always mean that we’re talking the same language and that we share the same understanding of the technology and its benefits. As such, the ideal starting point of this course is to discuss the concept in a little more detail, ensuring that we have common understanding of the subject matter before we delve any further into the detail.

Click here to read more

Big data can be defined through the use of the 6 V’s. They are Volume, Variety, Velocity, Veracity, Valence and Value

Click here to read more

Structured data has the same attributes (fields) on each line. Much like an Excel document. It has a repeatable and predictable pattern of data throughout the file.

Let’s think about structured data in relation to our lives at work. Do you have a database storing all your customers details and each of the purchases that they make? This is structured data. Every customer record will have the same attributes (first name, surname, address, etc….).

Click here to read more

We have two types of parallelisation: data and task. Data parallelisation is where we run the same function on the same dataset across multiple nodes (servers) while task parallelization is where we run many functions across many nodes and potentially different datasets at the same time.

Click here to read more

Ingestion refers to the process of bringing external data into the big data cluster for further analysis. There are two types of ingestion: Extract, Transform, Load (ETL) and Extract, Load, Transform (ELT).

Extract, Transform, Load refers to the process of transforming or manipulating the data in some way as it enters the cluster. For example, we might choose to drop certain fields, aggregate or enrich data. This can also be referred to as ‘schema-on-write’.

Click here to read more

INGESTION

Let’s get started with Sqoop – it’s the simplest component of the Hadoop ecosystem and can be described very quickly. Sqoop is a tool used for large-scale, bulk data transfer to/from a structured data source to HDFS or HBase.

Click here to read more

Kafka is a low-latency messaging system that can have many producers and many consumers.

To expand on this, we can take a look at the below diagram. You can see that we have three data sources pushing data into Kafka. This data is then channeled through Kafka’s highly available and fault tolerant structure and waits to be pulled by the consumers of the feed.

Click here to read more

Flume is similar to Kafka in many ways. It takes data from its source and distributes it to its destination. The differentiation here is that Flume is designed for high-throughput log streaming into Hadoop – it is not designed to ship out to a large number of consumers.

Click here to read more

NiFi (formerly known as Niagara Files) has bought about a new level of flexibility into the Hadoop ecosystem. Like Kafka and Flume, it’s used to move data from one system to another but unlike those platforms, it does so in a very user friendly manner.

Click here to read more

pache Storm provides us with a distributed, real-time computation platform. It has been designed to reliably process streams of large data with high velocity. To put this into something we can all appreciate, Storm was benchmarked processing one million tuples per second per node.

Click here to read more

Apache Spark is a framework available in Hadoop. That framework is comprised of SparkSQL, Spark Streaming and Spark Machine Learning Library (MLLib). We will go through the entire framework in this section as without doing so, we will not be able to fully appreciate the benefits offered by the Spark framework.

Click here to read more

This video covers AWS Kinesis at a high level. It’ll help you to understand the capabilities of AWS Kinesis, along with the different ingestion options that we have.

Click here to read more

We have discussed earlier in this series the benefits of Flafka (Flume and Kafka) and NiFi coupled with Kafka. Here, we will review those configurations and also look at other ways we may choose to ingest data into our platform by combining multiple components.

Click here to read more

STORAGE

HDFS stands for Hadoop Distributed File System. A distributed file system manages files and folders across multiple servers for the benefit of resiliency and rapid data processing by utilizing parallelisation (as discussed earlier).

Click here to read more

HBase is NOSQL database that runs on top of HDFS. It’s suited to real time read and write access to large datasets that have a flexible schema. What does that mean?

Click here to read more

MongoDB is a document oriented database and is one of the leaders of the NoSQL revolution. To summarize what a NoSQL database is, we can say that it’s a structure that enables us to consume data of varying structure (structured, semi-structured and unstructured), of varying schema and data type. They enable us to respond quickly to user demand through scalability and flexibility.

Click here to read more

DynamoDB is a non-relational (NoSQL) database platform delivered by AWS. This video takes you through an overview of the service & its capabilities.
Click here to read more

This section won’t focus too heavily on MySQL. It’s a very well documented, traditional relational database system. The focus of this article will be, why it’s part of a big data ecosystem.

To answer that question, we can use the below diagram, which shows a few different methods we can use to take data from its source and land it into storage. The purple line of the diagram shows the path that data destined for MySQL might take. The grey paths show the route that other data may take through the system.

Click here to read more

This video provides an overview of Amazon S3. It’ll enable you to talk with confidence with your colleagues about S3 & the various options available through the service. Click here to read more

Amazon RDS (Relational Database Service) is a fully managed relational database service in the cloud (e.g. MySQL PostgreSQL/ Auora). Fully managed means that AWS manages all the underlying hardware for you and handles the updating of software (like MySQL server versions).

Click here to read more

Amazon EMR is the Amazon Hadoop Distribution. This video provides an overview of the service, along with the MapR distribution available through AWS. Click here to read more

SECURITY & MANAGEMENT

Below are the core components of a security policy in Hadoop:

Click here to read more

Apache Ranger enables us to monitor and manage data security across our Hadoop cluster. It enables us to define a security policy for users or groups of the cluster once and apply it across all supported components in the Hadoop stack.

Click here to read more

Kerberos authenticates users. It can be complex to configure – to make everything a lot simpler, we can carry out simplified kerberos setup, config and maintenance through Ambari.

Kerberos is a network authentication protocol. It is designed to provide strong authentication for client/server applications by using secret-key cryptography. It has the following characteristics:

Click here to read more

YARN (Yet Another Resource Negotiator) is the central part of our cluster management. It decouples mapreduce resource management and scheduling with data processing. Both sides of this used to be handled by mapreduce itself.

Click here to read more