Easy Install Single-Node Hadoop on Ubuntu 20.04

Introduction:

As the world is turning digital, a gargantuan amount of data is being produced every second, giving birth to Big Data. Hadoop is an Apache-based open-source Big Data framework used for storing, analyzing, and processing vast data, which is implemented in Java.

Created in 2005 by the current Chief Architect of Cloudera, Doug Cutting, to solve two basic problems of Big Data, Storage and Processing. Its structure lies in the concept of distributed storage and distributed processing in a parallel fashion.

Apache Hadoop comprises three components. HDFS, MapReduce, and YARN.

The storage unit, the Hadoop Distributed File System, abbreviated as HDFS, makes copies of the data and stores it across multiple systems. The best part about HDFS is its fault-tolerance quality.

The MapReduce, or MR, a processing unit, is a core engine for doing all the ETLs (Extract, Transformation, and Load). MR operates each part of the Big Data discretely and then combines the result at the end, putting aside a considerate amount of time.

As multiple jobs run on Hadoop simultaneously, they all need more or fewer resources to complete the task successfully, so to adroitly manage these resources, there comes the third component, the YARN, short for Yet Another Resource Negotiator, which consists of Resource Manager, Node Manager, Application Master, and Containers.

This easy article covers the smooth installation of single-node Apache Hadoop as well as multi-node configuration.

PREREQUISITES:

Linux
Java

Step 1: Install OpenJDK on Ubuntu

$ sudo apt update
$ sudo apt install openjdk-8-jdk -y # Apache Hadoop 3.x fully supports Java 8

Once the installation process is complete, verify the current Java version:

$ java -version or javac -version

Step 2: Install OpenSSH on Ubuntu

For multi-node, you can ignore this step.

$ sudo apt install openssh-server openssh-client -y

Step 3: Create Hadoop User

$ sudo adduser hdoop

The username, in this example, is hdoop. You are free the use any username and password you see fit.

$ su - hdoop

Switch to the newly created user and enter the corresponding password. After creating a new user, you may get the following error while running sudo commands "hdoop is not in the sudoers file. This incident will be reported." So, this error comes because you haven't added sudo permissions to the new user, to add permissions, do the following:

$ su - {initial_user}
$ sudo -i # Go to root
root> usermod -a -G sudo hdoop
root> exit; # Problem solved

Step 4: Enable Passwordless SSH for Hadoop User

For a multi-node cluster, you have to ignore this step, as it doesn't require localhost.

$ ssh-keygen -t rsa -P '' -f ~/.ssh/id_rsa # Generates SSH key pair
$ cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys
$ chmod 0600 ~/.ssh/authorized_keys # Giving required permissions
$ ssh localhost

Step 5: Download and Install Hadoop on Ubuntu

Visit the official Apache Hadoop project page, and select the binary download version of Hadoop you want to implement.

Hadoop Download Page

The steps outlined in this tutorial use the Binary download for Hadoop Version 3.3.1.

Hadoop 3.3.1

$ wget https://dlcdn.apache.org/hadoop/common/hadoop-3.3.1/hadoop-3.3.1.tar.gz 
# Downloads hadoop-3.3.1 using wget
$ tar xzf hadoop-3.3.1.tar.gz # Extracts the tar to the hadoop-3.3.1 directory.

Setup IP-Configuration (Multi-Node only)

Before configuring the files of Hadoop, you have to set up the IP address connection between the two devices, they must be on the same LAN/WIFI network to be able to communicate with each other.

Open your hosts file using the below command:

$ sudo nano /etc/hosts # For opening hosts file

Enter the IPs of your hadoop-master and hadoop-slave PC like
192.168.0.xx2 hadoop-master
192.168.0.xx1 hadoop-slave

You have to do this step in all your connected nodes then move on to the next step.

Step 6: Hadoop Deployment on Single Node

A Hadoop environment is configured by editing a set of configuration files:

bashrc
hadoop-env.sh
core-site.xml
hdfs-site.xml
mapred-site-xml
yarn-site.xml

Configure Hadoop Environment Variables (bashrc)

$ sudo nano .bashrc # For opening .bashrc file

Define the Hadoop environment variables by adding the following content to the end of the file:

#Hadoop Related Options
export HADOOP_HOME=/home/hdoop/hadoop-3.3.1
export HADOOP_INSTALL=$HADOOP_HOME
export HADOOP_MAPRED_HOME=$HADOOP_HOME
export HADOOP_COMMON_HOME=$HADOOP_HOME
export HADOOP_HDFS_HOME=$HADOOP_HOME
export YARN_HOME=$HADOOP_HOME
export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_HOME/lib/native
export PATH=$PATH:$HADOOP_HOME/sbin:$HADOOP_HOME/bin
export HADOOP_OPTS="-Djava.library.path=$HADOOP_HOME/lib/native"

Now, save and exit the .bashrc file.

It is vital to apply the changes to the current running environment by using the following command:

$ source ~/.bashrc

Edit hadoop-env.sh File

$ sudo nano $HADOOP_HOME/etc/hadoop/hadoop-env.sh 
# For opening hadoop-env.xml file
export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64
$ readlink -f /usr/bin/javac # This tells the path to your OpenJDK directory

Edit core-site.xml File

$ sudo nano $HADOOP_HOME/etc/hadoop/core-site.xml #For opening core-site.xml file

Add the following configuration:

<configuration>
<property>
<name>hadoop.tmp.dir</name>
<value>/home/hdoop/tmpdata</value>
</property>
<property>
<name>fs.default.name</name>
<value>hdfs://127.0.0.1 or hadoop-master:9000</value>
</property>
</configuration>

Edit hdfs-site.xml File

The properties in the hdfs-site.xml file govern the location for storing node metadata, fsimage file, and edit log file. Configure the file by defining the NameNode and DataNode storage directories.

Additionally, the default dfs.replication value of 3 needs to be changed to 1 to match the single node setup.

$ sudo nano $HADOOP_HOME/etc/hadoop/hdfs-site.xml # For opening hdfs-site.xml file

Add the following configuration:

<configuration>
<property>
<name>dfs.data.dir</name>
<value>/home/hdoop/dfsdata/namenode</value>
</property>
<property>
<name>dfs.data.dir</name>
<value>/home/hdoop/dfsdata/datanode</value>
</property>
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
</configuration>

Edit mapred-site.xml File

$ sudo nano $HADOOP_HOME/etc/hadoop/mapred-site.xml # For opening mapred-site.xml file

Add the following configuration:

<configuration>
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
</configuration>

Edit yarn-site.xml File

$ sudo nano $HADOOP_HOME/etc/hadoop/yarn-site.xml # For opening yarn-site.xml file

<name>yarn.nodemanager.aux-services</name>

<value>mapreduce_shuffle</value>

</property>

<name>yarn.nodemanager.aux-services.mapreduce.shuffle.class</name>

<value>org.apache.hadoop.mapred.ShuffleHandler</value>

</property>

<name>yarn.resourcemanager.hostname</name>

<value>127.0.0.1 or hadoop-master</value>

</property>

<name>yarn.acl.enable</name>

</property>

<name>yarn.nodemanager.env-whitelist</name>

<value>JAVA_HOME,HADOOP_COMMON_HOME,HADOOP_HDFS_HOME,HADOOP_CONF_DIR,CLASSPATH_PERPEND_DISTCACHE,HADOOP_YARN_HOME,HADOOP_MAPRED_HOME</value>

</property>

</configuration>

Step 7: Format HDFS NameNode

It is important to format the NameNode before starting Hadoop services for the first time only:

$ hdfs namenode -format

Step 8: Start Hadoop Cluster

Navigate to the hadoop-3.3.1/sbin directory and execute the following command:

$ ./start-all.sh

This will start namenode, datanode, secondary namenode, and YARN resource manager and nodemanager. To check if all the daemons are active and running as Java processes:

$ jps

Note: If you see that datanode is not running on jps command then to fix this, you have to delete dfsdata and tmpdata folder in your home directory. Then reiterate Step 8.

Now, your Hadoop is up and running and ready to use. To stop the Hadoop cluster, you can use the below command:

$ ./stop-all.sh # To stop Hadoop Services

Access Hadoop UI from Browser

http://localhost:9870 # Hadoop NameNode UI

http://localhost:9864 # Hadoop DataNodes UI

http://localhost:8088 # YARN Resource Manager

Conclusion

You have successfully installed Hadoop on Ubuntu and deployed it in a single-node mode. To perform map-reduce jobs, I will be writing a new blog to explain the process in detail.