Category: Big Data Course

Big Data Crash Course: MySQL

This section won’t focus too heavily on MySQL. It’s a very well documented, traditional relational database system. The focus of this article will be, why it’s part of a big data ecosystem. To answer that question, we can use the below diagram, which shows a few different methods...

Big Data Crash Course: MongoDB

MongoDB is a document oriented database and is one of the leaders of the NoSQL revolution. To summarize what a NoSQL database is, we can say that it’s a structure that enables us to consume data of varying structure (structured, semi-structured and unstructured), of varying schema and data...

Big Data Crash Course: YARN

YARN (Yet Another Resource Negotiator) is the central part of our cluster management. It decouples mapreduce resource management and scheduling with data processing. Both sides of this used to be handled by mapreduce itself. The decoupled approach was taken to enable a far more varied toolset for data...

Big Data Crash Course: Hive

Apache Hive provides us with a familiar SQL-like query language to access and analyse the data stored in HDFS. Hive translates our SQL-like queries into mapreduce jobs – so even without programming experience, we can run jobs across many servers and benefit from data parallelisation. Within Hive, we...

Big Data Crash Course: Map Reduce

Map Reduce is a parallel computing framework that enables us to distribute computing across multiple data nodes. Let’s look at an example. We have a file, that has the below text included in it. Line 1 Welcome to the Map Reduce section of our big data post. You’ll...

Big Data Crash Course: HDFS

HDFS stands for Hadoop Distributed File System. A distributed file system manages files and folders across multiple servers for the benefit of resiliency and rapid data processing by utilizing parallelisation (as discussed in chapter one). So what does that mean? Let’s take a look. From the above, you...

Big Data Crash Course: Apache Ranger

Apache Ranger enables us to monitor and manage data security across our Hadoop cluster. It enables us to define a security policy for users or groups of the cluster once and apply it across all supported components in the Hadoop stack. Ranger currently supports security policies for HDFS,...

Big Data Crash Course: Kerberos, Knox and Atlas

Kerberos Kerberos authenticates users. It can be complex to configure – to make everything a lot simpler, we can carry out simplified kerberos setup, config and maintenance through Ambari. Kerberos is a network authentication protocol. It is designed to provide strong authentication for client/server applications by using secret-key...