Big Data Crash Course: YARN

YARN (Yet Another Resource Negotiator) is the central part of our cluster management. It decouples mapreduce resource management and scheduling with data processing. Both sides of this used to be handled by mapreduce itself.

The decoupled approach was taken to enable a far more varied toolset for data processing. Rather than mapreduce queries only, YARN also manages interactive queries, Hive, Pig, Scala, Java, HBase, Storm, Spark and plenty more tools.

When we say manage, what does it actually do? YARN allocates resources (CPU and memory) to a job. It can do this in a number of ways: LIFO (Last In First Out), FIFO (First In First Out), Capacity or Fair.

When we use the capacity method of resource management, it means that all small jobs start as soon as they’re submitted. They’re given priority over bigger jobs as they’re considered to be ‘quick wins’.

If we utilize the fair method, YARN balances resources between all running jobs. So, if you have only one job running it will be assigned 100% of resources. If you have 5 jobs, they’ll receive 20% each.

YARN also enables us to split the resources between system and user. This is useful as if you have mission-critical jobs running which must complete at a set time, we can allocate resources to ensure that they get finished. Any user jobs are considered to be ‘ad-hoc’ and ‘non-industrialized’ reports and hence do not carry the same weight as the mission-critical ones.

Below, we have a view of the YARN UI. This shows us what jobs are running currently, along with their progress; the technology being used (e.g. Spark); the user that initiated the task and a few other useful pieces of information.