While the term ‘big data’ is used a lot by small and large organizations alike, it doesn’t always mean that we’re talking the same language and that we share the same understanding of the technology and its benefits. As such, the ideal starting point of this course is to discuss the concept in a little more detail, ensuring that we have common understanding of the subject matter before we delve any further into the detail.
Big data can be defined through the use of the 6 V’s. They are Volume, Variety, Velocity, Veracity, Valence and Value
Structured data has the same attributes (fields) on each line. Much like an Excel document. It has a repeatable and predictable pattern of data throughout the file.
Let’s think about structured data in relation to our lives at work. Do you have a database storing all your customers details and each of the purchases that they make? This is structured data. Every customer record will have the same attributes (first name, surname, address, etc….).
We have two types of parallelisation: data and task. Data parallelisation is where we run the same function on the same dataset across multiple nodes (servers) while task parallelization is where we run many functions across many nodes and potentially different datasets at the same time.
Ingestion refers to the process of bringing external data into the big data cluster for further analysis. There are two types of ingestion: Extract, Transform, Load (ETL) and Extract, Load, Transform (ELT).
Extract, Transform, Load refers to the process of transforming or manipulating the data in some way as it enters the cluster. For example, we might choose to drop certain fields, aggregate or enrich data. This can also be referred to as ‘schema-on-write’.
Let’s get started with Sqoop – it’s the simplest component of the Hadoop ecosystem and can be described very quickly. Sqoop is a tool used for large-scale, bulk data transfer to/from a structured data source to HDFS or HBase.
Kafka is a low-latency messaging system that can have many producers and many consumers.
To expand on this, we can take a look at the below diagram. You can see that we have three data sources pushing data into Kafka. This data is then channeled through Kafka’s highly available and fault tolerant structure and waits to be pulled by the consumers of the feed.
Flume is similar to Kafka in many ways. It takes data from its source and distributes it to its destination. The differentiation here is that Flume is designed for high-throughput log streaming into Hadoop – it is not designed to ship out to a large number of consumers.
NiFi (formerly known as Niagara Files) has bought about a new level of flexibility into the Hadoop ecosystem. Like Kafka and Flume, it’s used to move data from one system to another but unlike those platforms, it does so in a very user friendly manner.
pache Storm provides us with a distributed, real-time computation platform. It has been designed to reliably process streams of large data with high velocity. To put this into something we can all appreciate, Storm was benchmarked processing one million tuples per second per node.
Apache Spark is a framework available in Hadoop. That framework is comprised of SparkSQL, Spark Streaming and Spark Machine Learning Library (MLLib). We will go through the entire framework in this section as without doing so, we will not be able to fully appreciate the benefits offered by the Spark framework.
This video covers AWS Kinesis at a high level. It’ll help you to understand the capabilities of AWS Kinesis, along with the different ingestion options that we have.
We have discussed earlier in this series the benefits of Flafka (Flume and Kafka) and NiFi coupled with Kafka. Here, we will review those configurations and also look at other ways we may choose to ingest data into our platform by combining multiple components.
HDFS stands for Hadoop Distributed File System. A distributed file system manages files and folders across multiple servers for the benefit of resiliency and rapid data processing by utilizing parallelisation (as discussed earlier).
HBase is NOSQL database that runs on top of HDFS. It’s suited to real time read and write access to large datasets that have a flexible schema. What does that mean?
MongoDB is a document oriented database and is one of the leaders of the NoSQL revolution. To summarize what a NoSQL database is, we can say that it’s a structure that enables us to consume data of varying structure (structured, semi-structured and unstructured), of varying schema and data type. They enable us to respond quickly to user demand through scalability and flexibility.
DynamoDB is a non-relational (NoSQL) database platform delivered by AWS. This video takes you through an overview of the service & its capabilities.
Click here to read more
This section won’t focus too heavily on MySQL. It’s a very well documented, traditional relational database system. The focus of this article will be, why it’s part of a big data ecosystem.
To answer that question, we can use the below diagram, which shows a few different methods we can use to take data from its source and land it into storage. The purple line of the diagram shows the path that data destined for MySQL might take. The grey paths show the route that other data may take through the system.
This video provides an overview of Amazon S3. It’ll enable you to talk with confidence with your colleagues about S3 & the various options available through the service. Click here to read more
Amazon RDS (Relational Database Service) is a fully managed relational database service in the cloud (e.g. MySQL PostgreSQL/ Auora). Fully managed means that AWS manages all the underlying hardware for you and handles the updating of software (like MySQL server versions).
Below are the core components of a security policy in Hadoop:
Apache Ranger enables us to monitor and manage data security across our Hadoop cluster. It enables us to define a security policy for users or groups of the cluster once and apply it across all supported components in the Hadoop stack.
Kerberos authenticates users. It can be complex to configure – to make everything a lot simpler, we can carry out simplified kerberos setup, config and maintenance through Ambari.
Kerberos is a network authentication protocol. It is designed to provide strong authentication for client/server applications by using secret-key cryptography. It has the following characteristics: