Sunday, August 30, 2015

The Concept of Apache Hadoop Mapreduce 2 or YARN ( Yet Another Resource Negotiator )


Introduction


Comparing to the previous versions of Hadoop, where we had the NameNode and JobTracker daemons on the Master node, MapReduce 2 or YARN was introduced to split the functionalities of JobTracker to specific daemons for resource management and job scheduling.

Comparison



If we see the diagram above, in previous versions of Hadoop, we had the cluster resource management layer tightly coupled and a part of the Mapreduce layer. In Hadoop 2.0, the we see a new layer Yet Another Resource Negotiator (YARN) has been introduced between Mapreduce and HDFS i.e. responsible for cluster resource management. 

To know the reason why this change was needed, let us first recap how it worked in Hadoop 1.0 understanding the downsides as well.


If you follow the figure above, in Hadoop 1.0 , the JobTracker is a part of the MapReduce framework which manages the MapReduce Jobs / Applications along with doing the cluster resource management. Every MapRed job is divided into a number of Map and Reduce tasks, which run on datanodes (slaves) having an assigned or limited number of tasks slots. Having said that JobTracker manages the jobs along with the cluster resource management, JobTracker keeps an account for all these tasks, the predefined task slots on machines, status of the running tasks, resource reporting and management of the cluster, re-firing task jobs incase any of the tasks failed, and finally to clean up temporary resources and release task slots once the task is complete. Additionally, the tightly integrated JobTracker is also one of the reasons why non-MapReduce applications cannot be run on the Hadoop nodes, some e.g. realtime jobs, running ad-hoc queries. Also, there can be only one JobTracker in a cluster, hence we can figure out the load this daemon alone deals with.

To summarise it, MapReduce 2.0 - YARN overcomes the below issues from MapReduce 1 :
  1. Scalability – Since the primary focus of YARN is scheduling, it can manage these huge clusters more efficiently. The ability to process data rapidly increases
  2. Compatibility with existing MapReduce based application–  YARN can easily configure and run the existing MapReduce application without any hindrance or modification in existing processes.
  3. Better Cluster Utilisation – YARN Resource Manager optimizes the cluster utilization as per the given criteria, such as capacity guarantees, fairness, and other Service Level Agreements. 
  4. Support for additional workloads apart from MapReduce – Upcoming programming models such as graph processing and iterative modelings are now a part of data processing. These new models are easily integrated with YARN, which helps the senior management in any organization to realize their real time data and other market trends.
  5. Agility – YARN facilitates the operation of the resource management layer in a more Agile manner.

Components of Mapreduce 2.0 - YARN


As mentioned in the beginning of this blog, YARN has split the JobTrackers responsibility into 2, i.e. Resource Management and Job Scheduling. Yarn achieves that using the below components
  1. Global Resource Manager (RM): This is the root of YARN, and it governs the entire cluster. It has an inbuilt scheduler, uses capacity scheduling by default. This provides resources to all applications running in the cluster. RM interacts with the Node Managers (Point 3 below) and Application Masters (Point 2 below)
  2. Application Master per Application (AM) : The Node Manager initiates the Application master container as instructed by the RM. Application master negotiates appropriate resource containers from RM, track the status and monitor the progress as well. AM is the container that controls the entire application / task.
  3. Node Manager per Node slave (NM) : The NM launches the containers, monitors the CPU, RAM, disk, network etc on the nodes and report it back to the RM to make a judgement call to cater new requests coming in. 
  4. Container per application running on Node manager : In contrast to the previous versions of Hadoop where we allocated slots to task, in MapReduce 2.0 respective NMs allocates resources per node available for an application in form of containers.
I hope this blog may help you clear the concept, will add / update more as it comes.

Cheers.


No comments:

Post a Comment

The Azure Synapse Resource Provider Error

  If you are get the error " The Azure Synapse resource provider (Microsoft.Synapse) needs to be registered with the selected subscript...