How To Install Multi-Node Hadoop On Ubuntu 20.04

Introduction:

As the world is turning digital, a gargantuan amount of data is being produced every second, giving birth to Big Data. Hadoop is an Apache-based open-source Big Data framework used for storing, analyzing, and processing vast data, which is implemented in Java.

Created in 2005 by the current Chief Architect of Cloudera, Doug Cutting, to solve two basic problems of Big Data, Storage and Processing. Its structure lies in the concept of distributed storage and distributed processing in a parallel fashion.

Apache Hadoop comprises three components. HDFS, MapReduce, and YARN.

The storage unit, the Hadoop Distributed File System, abbreviated as HDFS, makes copies of the data and stores it across multiple systems. The best part about HDFS is its fault-tolerance quality.

The MapReduce, or MR, a processing unit, is a core engine for doing all the ETLs (Extract Transformation and Load). MR operates each part of the Big Data discretely and then combines the result at the end, putting aside a considerate amount of time.

As multiple jobs run on Hadoop simultaneously, they all need more or fewer resources to complete the task successfully, so to adroitly manage these resources, there comes the third component, the YARN, short for Yet Another Resource Negotiator, which consists of Resource Manager, Node Manager, Application Master, and Containers.

This easy article covers the smooth installation of multi-node Apache Hadoop. We are going to proceed with two PCs, one as Hadoop-master and the other as Hadoop-slave.

PREREQUISITES:

Linux
Java

Step 1: Install OpenJDK on Ubuntu

$ sudo apt update
$ sudo apt install openjdk-8-jdk -y # Apache Hadoop 3.x fully supports Java 8

Once the installation process is complete, verify the current Java version:

$ java -version or javac -version

Step 2: Create Hadoop User

$ sudo adduser hdoop

The username, in this example, is hdoop. You are free the use any username and password you see fit.

$ su - hdoop

Switch to the newly created user and enter the corresponding password.

Step 3: Download and Install Hadoop on Ubuntu

Visit the official Apache Hadoop project page, and select the binary download version of Hadoop you want to implement.

Hadoop Download Page

The steps outlined in this tutorial use the Binary download for Hadoop Version 3.3.1.

Hadoop 3.3.1

$ wget https://dlcdn.apache.org/hadoop/common/hadoop-3.3.1/hadoop-3.3.1.tar.gz 
# Downloads hadoop-3.3.1 using wget
$ tar xzf hadoop-3.3.1.tar.gz # Extracts the tar to the hadoop-3.3.1 directory.

Step 4: IP-Configuration

Before configuring the files of Hadoop, you have to set up the IP address connection between the two devices, they must be on the same LAN/WIFI network to be able to communicate with each other.

Open your hosts file using the below command:

$ sudo nano /etc/hosts # For opening hosts file

Enter the IPs of your hadoop-master and hadoop-slave PC like
192.168.0.xx2 hadoop-master
192.168.0.xx1 hadoop-slave

You have to do this step in all your connected nodes then move on to the next step.

Create SSH Key in the master node then copy it to all slave nodes.

$ ssh-keygen -t rsa
$ ssh-copy-id hdoop@hadoop-master
$ ssh-copy-id hdoop@hadoop-slave

Step 5: Hadoop Deployment on Multi-Node

A Hadoop environment is configured by editing a set of configuration files in all nodes:

bashrc
hadoop-env.sh
core-site.xml
hdfs-site.xml
mapred-site-xml
yarn-site.xml

Configure Hadoop Environment Variables (bashrc)

$ sudo nano .bashrc # For opening .bashrc file

Define the Hadoop environment variables by adding the following content to the end of the file:

#Hadoop Related Options
export HADOOP_HOME=/home/hdoop/hadoop-3.3.1
export HADOOP_INSTALL=$HADOOP_HOME
export HADOOP_MAPRED_HOME=$HADOOP_HOME
export HADOOP_COMMON_HOME=$HADOOP_HOME
export HADOOP_HDFS_HOME=$HADOOP_HOME
export YARN_HOME=$HADOOP_HOME
export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_HOME/lib/native
export PATH=$PATH:$HADOOP_HOME/sbin:$HADOOP_HOME/bin
export HADOOP_OPTS="-Djava.library.path=$HADOOP_HOME/lib/native"

Now, save and exit the .bashrc file.

It is vital to apply the changes to the current running environment by using the following command:

$ source ~/.bashrc

Edit hadoop-env.sh File

$ sudo nano $HADOOP_HOME/etc/hadoop/hadoop-env.sh 
# For opening hadoop-env.xml file
export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64
$ readlink -f /usr/bin/javac # This tells the path to your OpenJDK directory

Edit core-site.xml File

$ sudo nano $HADOOP_HOME/etc/hadoop/core-site.xml #For opening core-site.xml file

Add the following configuration:

<configuration>
<property>
<name>hadoop.tmp.dir</name>
<value>/home/hdoop/tmpdata</value>
</property>
<property>
<name>fs.default.name</name>
<value>hdfs://hadoop-master:9000</value>
</property>
</configuration>

Edit hdfs-site.xml File

The properties in the hdfs-site.xml file govern the location for storing node metadata, fsimage file, and edit log file. Configure the file by defining the NameNode and DataNode storage directories.

$ sudo nano $HADOOP_HOME/etc/hadoop/hdfs-site.xml # For opening hdfs-site.xml file

Add the following configuration:

<configuration>
<property>
<name>dfs.data.dir</name>
<value>/home/hdoop/dfsdata/namenode</value>
</property>
<property>
<name>dfs.data.dir</name>
<value>/home/hdoop/dfsdata/datanode</value>
</property>
<property>
<name>dfs.replication</name>
<value>3</value>
</property>
</configuration>

Edit mapred-site.xml File

$ sudo nano $HADOOP_HOME/etc/hadoop/mapred-site.xml # For opening mapred-site.xml file

Add the following configuration:

<configuration>
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
</configuration>

Edit yarn-site.xml File

$ sudo nano $HADOOP_HOME/etc/hadoop/yarn-site.xml # For opening yarn-site.xml file

<name>yarn.nodemanager.aux-services</name>

<value>mapreduce_shuffle</value>

</property>

<name>yarn.nodemanager.aux-services.mapreduce.shuffle.class</name>

<value>org.apache.hadoop.mapred.ShuffleHandler</value>

</property>

<name>yarn.resourcemanager.hostname</name>

<value>hadoop-master</value>

</property>

<name>yarn.acl.enable</name>

</property>

<name>yarn.nodemanager.env-whitelist</name>

<value>JAVA_HOME,HADOOP_COMMON_HOME,HADOOP_HDFS_HOME,HADOOP_CONF_DIR,CLASSPATH_PERPEND_DISTCACHE,HADOOP_YARN_HOME,HADOOP_MAPRED_HOME</value>

</property>

</configuration>

Step 6: Format HDFS NameNode

It is important to format the NameNode before starting Hadoop services for the first time:

$ hdfs namenode -format

Step 7: Start Hadoop Cluster

Navigate to the hadoop-3.3.1/sbin directory in your hadoop-master and execute the following command:

$ ./start-all.sh

This will start namenode, datanodes, secondary namenode, and YARN resource manager and nodemanagers in both your hadoop master and slaves. To check if all the daemons are active and running as Java processes:

$ jps

Now, your Hadoop is up and running and ready to use. To stop Hadoop cluster, you can use below command:

$ ./stop-all.sh # To stop Hadoop Services

Access Hadoop UI from Browser

http://hadoop-master:9870 # Hadoop NameNode UI

http://hadoop-slave:9864 # Hadoop DataNodes UI

http://hadoop-master:8088 # YARN Resource Manager

Conclusion

You have successfully installed Hadoop on Ubuntu and deployed it in multi-node mode. To perform map-reduce jobs, I will be writing a new blog to explain the process in detail.