TechAE Blogs - Explore now for new leading-edge technologies

TechAE Blogs - a global platform designed to promote the latest technologies like artificial intelligence, big data analytics, and blockchain.

Full width home advertisement

Post Page Advertisement [Top]

Spark Setup

How To Setup Apache Spark On Hadoop Cluster

Spark is a framework that uses RDD to process massive volumes of data by referencing external distributed databases. Machine learning applications, data analytics, and graph-parallel processing all employ Spark in distributed computing.
We discussed Spark in detail in the previous blog, now this tutorial will walk you through installing and testing Apache Spark on Ubuntu 20.04.

Step 1: Download and install the Apache Spark binaries

Spark binaries are available from https://spark.apache.org/downloads.html. Download the latest version of the spark from the website given above.

Log on node-master as the Hadoop user and run the following command:

$ wget  https://dlcdn.apache.org/spark/spark-3.2.1/spark-3.2.1-bin-hadoop3.2.tgz
$ tar -xvf spark-3.2.1-bin-hadoop3.2.tgz
$ mv spark-3.2.1-bin-hadoop3.2 spark

Step 2: Spark Master Configuration

2.1 Integrate Spark with YARN

Edit the bashrc file /home/hdoop/.bashrc and add the following lines:

export SPARK_HOME=/home/hdoop/spark
export PATH=/home/hdoop/spark/bin:$PATH
export LD_LIBRARY_PATH=/home/hdoop/hadoop/lib/native:$LD_LIBRARY_PATH
Use the following command for sourcing the ~/.bashrc file.
$ source ~/.bashrc

Restart your session by logging out and logging in again.

Now, we have to rename the spark default template config file:

$ cp $SPARK_HOME/conf/spark-defaults.conf.template $SPARK_HOME/conf/spark-defaults.conf

Edit $SPARK_HOME/conf/spark-defaults.conf and set spark.master to yarn:

spark.master    yarn

Spark is now ready to interact with your YARN cluster.

2.2 Edit spark-env.sh

Move to the spark conf folder and create a copy of the template of spark-env.sh and rename it.

$ cd /home/hdoop/spark/conf
$ cp spark-env.sh.template spark-env.sh

Now edit the configuration file spark-env.sh.

$ sudo nano spark-env.sh

And set the following parameters. This is for the Multi-Node Hadoop cluster

export SPARK_MASTER_HOST='<MASTER-IP>'
export JAVA_HOME=<Path_of_JAVA_installation>

This is for Single-Node Hadoop Clutser
export SPARK_MASTER_HOST='<MASTER-IP>'
export JAVA_HOME=<Path_of_JAVA_installation>

2.3 Add Workers

Edit the configuration file slaves in (/home/hdoop/spark/conf).

$ sudo nano workers

If you have Multi-nodes then add the following entries:

master

slave01

slave02

Otherwise, for single node, you just have to add:

localhost

2.4 Start Spark Cluster

To start the spark cluster, run the following command on master.

$ cd /home/hdoop/spark
$ ./sbin/start-all.sh

Step 3: Check whether services have been started

To check daemons on master and slaves, use the following command.

$ jps
Spark Nodes

Step 4: Spark Web UI

Browse the Spark UI to know about worker nodes, running applications, and cluster resources.
http://localhost:8080/

Spark Master

Step 5: Start spark-sql

To run queries on the spark cluster, run the following command on any node.

cd /home/hdoop/spark
$ ./bin/spark-sql

Stop Spark Cluster

To stop the spark cluster, run the following command on master.

cd /home/hdoop/spark
$ ./sbin/stop-all.sh

No comments:

Post a Comment

Thank you for submitting your comment! We appreciate your feedback and will review it as soon as possible. Please note that all comments are moderated and may take some time to appear on the site. We ask that you please keep your comments respectful and refrain from using offensive language or making personal attacks. Thank you for contributing to the conversation!

Bottom Ad [Post Page]