Understanding Hadoop. Yarn

Yet another article about Yet Another Resource Negotiator

Mykola-Bohdan Vynnytskyi
4 min readAug 14, 2022
Photo by Nam Anh on Unsplash

Intro

In the previous article, we learned about Hadoop and one of its components, HDFS.
Today we will deal with another part of this big yellow elephant — Yarn.

YARN stands for “Yet Another Resource Negotiator“. The key idea of YARN is to split up resource management and job scheduling/monitoring functionalities into separate daemons.

YARN provides APIs for requesting and working with cluster resources, but we don’t use these APIs directly. Instead, we write to higher-level APIs provided by distributed computing frameworks, which themselves are built on YARN and hide the resource management details from the user.

However, in this article, we will look at these hidden details and understand how they work together.

Architecture

YARN provides its main services through two types of long-running daemons:
a Resource Manager to manage resource usage in the cluster and Node Managers running on all nodes in the cluster to run and monitor execution.
Let’s dive into the details!

Resource Manager

It is the master daemon of YARN, and its responsibility includes allocating and managing resources between all applications.

Whenever it receives a request for processing, it forwards it to the appropriate node manager and allocates resources to handle the request.

It consists of two main components:

  • Scheduler
    Performs scheduling based on the selected program and available resources.
    It is a pure scheduler, meaning that it does not perform other tasks such as monitoring or tracing, and does not guarantee a restart if a task fails.
  • Application manager
    Responsible for accepting the application and coordinating the first container with the Resource Manager.
    It also restarts the Application Master (we’ll talk about this shortly) container if the task is failed.

A container is a collection of a constrained set of resources (memory, CPU, etc.) on which tasks are executed.

Node Manager

This is a process that is responsible for all resources on a separate node.
Its primary goal is to manage application containers assigned to it by the Resource Manager.

It registers with the Resource Manager and sends a heartbeat with the node’s health status.

At the request of the Application Master, it starts this or that task, monitors the use of resources by the task, and upon completion or due to an error,
kills the execution process.

Application Master

An application is a single job submitted to YARN. There is one Application Master for each job.
Its main task is to coordinate resources with the Resource Manager and work with the Node Manager to execute and monitor each task component.
Once started, it periodically sends heartbeats to the Resource Manager to affirm its health and to update the record of its resource demands.

In short, a Resource Manager is a master who has information about all resources and provides them on demand. It also keeps track of where these resources go and when it’s time to return them by communicating with the Node Manager.

A Node Manager, on the other hand, is a worker who owns certain resources on a certain node, rents out those resources to perform certain work, and communicates with the Resource Manager about what is happening on the node.

And the Application Master is the one who rents the resources to perform the necessary work, periodically reporting on its execution.

How do they work together?

YARN application workflow
  1. The user contacts the Resource Manager and asks it to run an Application Master process.
  2. The Resource Manager finds a Node Manager that has appropriate resources to launch the Application Master. Application Master will be launched in the container.
  3. Application Master, in turn, can simply run a computation in the container it is running in and return the result to the client or like in our case request more containers from the Resource Manager (if there are not enough resources).
  4. After obtaining the necessary resources, the job is processed in a distributed fashion.

Conclusion

In this article, we found out what the YARN is, its components, and how they interact with each other.

Of course, this short article couldn’t contain everything, but I hope it helped you better understand what YARN is and how it works.

Check out my other articles if you are interested in Big Data technologies and everything related to them.

Thank you for reading.
See you!

--

--

Mykola-Bohdan Vynnytskyi

Big Data Software Engineer by day. YouTuber, author, and creator of courses by night. Passionate about Big Data and self-development.