Before we look at the overall architecture of the Hadoop ecosystem, let’s first think about the core concepts of any big data system: ingestion, storage, analysis and management.
Ingestion refers to the process of bringing external data into the big data cluster for further analysis. There are two types of ingestion: Extract, Transform, Load (ETL) and Extract, Load, Transform (ELT).
Extract, Transform, Load refers to the process of transforming or manipulating the data in some way as it enters the cluster. For example, we might choose to drop certain fields, aggregate or enrich data. This can also be referred to as ‘schema-on-write’.
Extract, Load, Transform refers to the process of moving data from its current source to our big data cluster in its raw format – with no transformation or manipulation taking place on the data. This can also be referred to as ‘schema-on-read’.
Storage refers to the location of the data within the cluster. We have various storage options and it’s not always obvious which ones to use. We will discuss storage options in further detail later in this book.
Analysis is the process of taking our stored data and deriving value from it through analysis. This can include joining multiple data sources together, generating reports, predictive analytics, machine learning and lots more. This also includes the visualisation of the data to make it useful for the end recipient.
We need to have a management wrap around everything we do with the big data cluster. The management layer enables us monitor and control all of our ingestion, storage and analytics to ensure that we extract the maximum value from our big data initiatives.
And of course, we also have security which is paramount for all big data systems. As architects, we need to ensure that the data is locked down to only those that require access; that the cluster is hidden behind a firewall and that all our traffic is encrypted. We discuss all of these concepts in more detail later in this book.
Below, you’ll see an overview of Hadoop ecosystem. This diagram should help you to conceptualise the links between each of the elements and understand in more detail how the Hadoop cluster works. At this stage, it is not necessary to understand what any of the components are or what they do, we will cover this in later articles. For now, take some time to review the diagram and appreciate the complexities of the big data world.