Understanding Hadoop. MapReduce

The only article you need to understand MapReduce

5 min readSep 25, 2022

Intro

In previous articles, we talked about two important components of Hadoop: HDFS and Yarn.
Today we continue to explore Hadoop and talk about another component, not so popular nowadays, but still used — MapReduce.

MapReduce is a data processing layer of Hadoop. It allows us to process huge amounts of data in parallel by dividing the job into a set of independent tasks.
The key idea behind it, we write some code (business logic) for data processing that represents Map and Reduce phases (talk about it soon), and the rest things will be taken by the engine.

Data Flow

An example of a map-reduce job in a word count program

MapReduce works by breaking the data processing into two phases: Map and Reduce.
The map is the first phase of processing, where we specify all complex logic like data transformation.
Reduce is the second phase, where we specify lightweight operations like an aggregation of the transformed data.

Sounds easy, isn’t it? Let’s dive deeper and understand how everything works in detail.

InputFormat

A MapReduce job is created to extract useful information from existing data. The data are called input files and they usually live in HDFS.
Before processing data, we must read them, and to read them effectively we must read them in parallel.
This is precisely what InputFormat does. It defines how to split and read input files.

InputSplit

InputFormat splits data and creates a logical representation of the data called InputSplits. Each split is divided into records and each record will be processed by the mapper (talk about it shortly).

By default, files are broken into 128MB chunks (same as blocks in HDFS), but we can set mapred.min.split.size parameter in mapred-site.xml to control this value.

The important thing to understand is that InputSplit does not contain the input data it is just a reference to the data, that is why we called it logical representation.

RecordReader

In addition to creating InputSplit, InputFormat creates RecordReader which is responsible for reading each record and converting it to key-value pairs suitable for reading by the mapper.

It communicates with InputSplit until the entire file is read. Each line is assigned a unique number (called byte offset) that will serve as a key. After that, the data is sent to the mapper.

Mapper

For each InputSplit will be created, a mapper.

A mapper is a processing unit that can only read key-value pairs record and generate new ones. These pairs can be completely different from the input pair.

The output data of all mappers are called the intermediate result and it is stored on the local disk.

Shuffling and Sorting

Data from all mappers are sorted and sent to the corresponding reducers

As already mentioned in the MapReduce job, there are two phases of a map (which we discussed above) and reduce (which we will discuss later).
Each of them has its purpose, when the mapper is busy with the main logic of data transformation, reduce deals with data aggregation.

As we already know, the mapper stores the output data on the local disk, the work of the reducer depends on these intermediate data so that it can process it, it must be delivered.
The process of transferring data from the mapper to the reducer is known as shuffling.
Shuffling can start before all mappers have finished their work, saving time and finishing the job in less time.

Before starting of reducer, all intermediate key-value pairs that are generated by the mapper get sorted by key, the values on another hand can be in any order. Sorting helps the reducer to easily distinguish when a new reduce task should start. It starts when the next key in the sorted input data is different than the previous.

Combiner

Shuffle is an extremely expensive operation, it requires physical movement of data, which can be very time-consuming and very costly.
Unfortunately, it cannot always be avoided, but you can at least minimize the costs.

Between the map and reduce phases, we can apply an optional operation called Combiner.
Сombiner, also known as a “Mini-Reducer”, is designed to minimize the amount of data that needs to be sent over a network by summarizing the output data by its key.

Reducer

We have already mentioned it here and there, let’s now summarize what this reducer is.

Reducer processes the output data from the mapper and creates a new set of output data to be stored in HDFS.
Because the intermediate data are sorted, reducers can process them in parallel and independently of each other.

By default, the number of reducers is 1, but the user can define their number by himself.

Conclusion

In this article, we familiarized ourselves with MapReduce, we saw in detail the path that data takes to acquire the form we need.
Hope this helped you understand this topic better.

Make sure you check out my other Big Data articles and don’t forget to subscribe 😉

Thank you for reading.
See you!