Components of Hadoop Ecosystem

This article explains the history and why Hadoop is used and also what are its components, it's a simple and complete explanation of Apache Hadoop.

History:

In the 1990s, the digital revolution became genuinely worldwide, and in the 2000s, the digital revolution reached the people in developing countries. As people started saving the data digitally instead of manually, the genesis of Big Data occurred. According to a report (2022), 2.5 quintillion bytes of data are created every day. As the data was being kept digitally at the time, there was an urgent need for a tool that could both store and handle Big Data. So, in 2005, Apache Hadoop was created by Doug Cutting and Mike Cafarella. It was named after Doug Cutting's son's toy elephant. It was initially released in 2006, on April 1^st.

Introduction:

Written in Java programming language, Apache Hadoop is an open-source software platform for storing and analyzing huge data volumes on commodity hardware clusters. Hadoop is a top-level Apache project that is being produced and utilized by a global community of contributors and consumers. It is distributed under the terms of the Apache License 2.0. It analyzes, processes, and manages big data.

Apache Hadoop is made up of four core components that are:

1. Hadoop Distributed File System (HDFS)

2. MapReduce Engine (MR)

3. Yet Another Resource Negotiator (YARN)

4. Hadoop Common

We'll go over them one by one.

Component #1. Hadoop Distributed File System (HDFS):

Because one of the reasons for the creation of Apache Hadoop was to store data, this component is in charge of storing the data.

HDFS can store data in any amount – from megabytes to petabytes – and format, including structured and unstructured data, very easily and without any hassle. It has a data rate of more than 2 GB per second, which is exceptionally excellent and praise-worthy.

HDFS provides Hadoop to work on clusters i.e. multiple interconnected machines and handle thousands of terabytes of data.

HDFS has two types of nodes.

Data Nodes: (more than one)

On data nodes, the actual data to be examined is placed. They are also known as Slaves.

Name Node: (usually one)

It contains the data's details. It is also known as Master.

It is to be remembered that only the list of blocks and location for any given file of HDFS is stored in the Name Node and never the actual data. With this information, Name Node determines how to construct the file from blocks. Both Name Node and Data Nodes are in constant communication. Now, the Name Node is a single point of failure. For that purpose, we use a standby Name Node, called a Secondary Name Node, to come into action if the Name Node gets corrupt.
When a Data Node is down, it does not affect the availability of data due to the replication method, and it is done by Name Node to bring one of those replicated blocks in place of the malfunctioning one.

Now how the HDFS is working? Let’s take a look at that too!

Instead of storing data altogether, HDFS split it into multiple blocks. These blocks are known as Data Nodes. The entire notion of HDFS is that data is stored in a distributed manner. The default size of a block is 128MB which can be changed.

Now, if one Data Node (or Block) crashes, do we lose that specific piece of data?

The answer is, NO! Because that’s the beauty of HDFS. HDFS makes copies of the data and stores it across multiple systems with a replication factor of 3x. It means that the same piece of data is stored on multiple nodes. This is termed the replication method. As a result, data is never lost, making HDFS fault-tolerant.

After the data has been correctly saved, it must now be processed. So, here comes MapReduce, Apache Hadoop's second component.

Component #2. MapReduce (MR):

This is a generic concept based on the Divide and Conquer approach that dates back to the 1970s. This is not only restricted to Apache Hadoop but MongoDB and Splunk also uses MapReduce.

MapReduce was first developed by Google and they were using it in their project for a long time, and published this MapReduce paper in 2004. It was developed by Jeffery Dean and Sanjay Ghemawat of Google.

This is specifically used for large amounts of data. It is not a language basically, it’s a framework, it’s an engine, and it’s a way of programming. We can use any programming language on MapReduce but the MapReduce that is associated with Apache Hadoop is implemented in Java.

It divides data into sections, processes each individually on distinct data nodes, and then aggregates the individual outputs to produce the final output. This improves load balance while also saving a significant amount of time.

This is done within this order:

The data is taken as input.
It is split into different blocks.
It is passed from a Mapper phase.
Then it is shuffled and sorted.
It is passed to Reducer Phase.

Now, there are Mapper functions in every block that run in a parallel manner. They all put forward their output and give it to Reducer which is only present in one block (by default, can be increased) and the Reducer reduces all the answers from the Mapper and gives the final result. Mappers just run independently of others, with no or fewer data dependencies.
If we want to define the Mapper and Reducer functions, this would be done as:

The Mapper turns a collection of data into a new set of data in which individual pieces are split down into tuples (key-value pairs).
The Reducer takes the Mappers' output as input and merges the data tuples into a smaller collection of tuples (key-value pairs).

Now, our Hadoop cluster is ready. As explained earlier, multiple jobs run on Hadoop simultaneously and each of them needs some resources to complete the task successfully. To efficiently manage these resources (RAM allocation, network bandwidth, etc.), we have the third module of Apache Hadoop which is YARN.

Component #3. Yet Another Resource Negotiator (YARN):

YARN is in charge of allocating resources to several MapReduce tasks that are operating at the same time.

It consists of:

Resource Manager
Node Manager
Application Master
Containers

Let's take a look at each one separately.

The task of assigning resources falls to the Resource Manager.
The Node Manager is in charge of the nodes and keeps track of their resource utilization.
Containers are used to store a variety of physical materials or resources.
The Application Master is the process that manages the cluster's application execution. Each application has its own Application Master, who is responsible for negotiating resources (Containers) from the Resource Manager and collaborating with the Node Managers to perform and monitor tasks.

Suppose we want to process the MR Job that we had created, first the application master requests the container from the node manager. Once the node manager gets the resources, it sends them to the resource manager. This way, YARN processes job requests and manages cluster resources.

Component #4. Hadoop Common:

Hadoop Common, also known as Hadoop Core, is a set of utilities and libraries that work together to support other Hadoop modules. Hadoop Common, like all other modules, expects that hardware problems are common and that the Hadoop Framework should manage them automatically in software.

The Hadoop Common package is the framework's foundation/core, as it offers important services and operations including abstraction of the underlying operating system and file system. Hadoop Common also includes the Java Archive (JAR) files and scripts needed to get Hadoop up and running. The Hadoop Common package also includes source code and documentation, as well as a contribution area with several Hadoop Community projects.

Advantages:

1. It is used in data warehousing, fraud detection, recommendation systems, and so on.
2. It is an open-source framework.
3. It accepts a variety of data.
4. It is cost-effective.
5. Hadoop's distributed processing and distributed storage design rapidly handles massive volumes of data. In 2008, Hadoop even defeated the fastest supercomputer.
6. It can withstand faults.
7. The majority of upcoming Big Data technologies are Hadoop-compatible.
8. It is highly scalable.
9. It is very convenient to use.

Disadvantages:

1. Hadoop is ideal for a limited number of huge files, but it fails when it comes to applications that deal with a large number of little files.

2. It is written in Java, one of the difficult programming languages. Moreover, Java can be easily abused by cyber thieves, making Hadoop subjected to security breaches.

3. Hadoop supports only batch processing.

4. Because Hadoop cannot do in-memory calculations, it incurs processing costs.

5. Hadoop cannot do iterative processing on its own.
6. Hadoop employs Kerberos authentication for security, which is difficult to administer.

Conclusion:

Hadoop has proved to be a game-changer in businesses from startups to big giants like Facebook, IBM, eBay, and Amazon. Hadoop ecosystem comprises Apache Hive, Pig, Spark, Floom, Scoop, etc. They all work together on the management of big data. For setting up Hadoop in a Single Node cluster, you can follow this blog (Apache Hadoop Setup).

More resources:

See you next time,

@TechAE