Run MapReduce Job On Hadoop
Want to run the MapReduce job on Hadoop Cluster? Here is the simplest tutorial that is sure to help.
What is MapReduce?
Hadoop MapReduce is a framework for quickly developing applications that process massive volumes of data (multi-terabytes of data) in parallel on huge clusters (thousands of nodes) in a reliable, failure-tolerant method.
A MapReduce job typically divides the input data set into distinct pieces that are handled in parallel by the map jobs. The framework sorts the map outputs, which are subsequently fed into the reduction jobs. Typically, both the job's input and output are saved in a file system. The framework manages task scheduling, task monitoring, and task re-execution upon failure.
PREREQUISITES:
Apache Hadoop must be configured and running. For:
- Single-Node Setup for new users
- Multi-Node Cluster for larger cluster
Table of Contents
- WordCount Program
- Updating Hadoop-env.sh
- Compilation of WordCount Program
- Creating Input Files
- Executing Application
- Results
Step 1: WordCount Program
WordCount is a simple application that counts the number of occurrences of each word in a given input data. Write the below code in WordCount.java and save it in your $HADOOP_HOME directory.
import java.io.IOException; import java.util.StringTokenizer; import org.apache.hadoop.conf.Configuration; import org.apache.hadoop.fs.Path; import org.apache.hadoop.io.IntWritable; import org.apache.hadoop.io.Text; import org.apache.hadoop.mapreduce.Job; import org.apache.hadoop.mapreduce.Mapper; import org.apache.hadoop.mapreduce.Reducer; import org.apache.hadoop.mapreduce.lib.input.FileInputFormat; import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat; public class WordCount { public static class TokenizerMapper extends Mapper<Object, Text, Text, IntWritable>{ private final static IntWritable one = new IntWritable(1); private Text word = new Text(); public void map(Object key, Text value, Context context ) throws IOException, InterruptedException { StringTokenizer itr = new StringTokenizer(value.toString()); while (itr.hasMoreTokens()) { word.set(itr.nextToken()); context.write(word, one); } } } public static class IntSumReducer extends Reducer<Text,IntWritable,Text,IntWritable> { private IntWritable result = new IntWritable(); public void reduce(Text key, Iterable<IntWritable> values, Context context ) throws IOException, InterruptedException { int sum = 0; for (IntWritable val : values) { sum += val.get(); } result.set(sum); context.write(key, result); } } public static void main(String[] args) throws Exception { Configuration conf = new Configuration(); Job job = Job.getInstance(conf, "word count"); job.setJarByClass(WordCount.class); job.setMapperClass(TokenizerMapper.class); job.setCombinerClass(IntSumReducer.class); job.setReducerClass(IntSumReducer.class); job.setOutputKeyClass(Text.class); job.setOutputValueClass(IntWritable.class); FileInputFormat.addInputPath(job, new Path(args[0])); FileOutputFormat.setOutputPath(job, new Path(args[1])); System.exit(job.waitForCompletion(true) ? 0 : 1); } }
Step 2: Updating Hadoop-env.sh
export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64 export PATH=${JAVA_HOME}/bin:${PATH} export HADOOP_CLASSPATH=${JAVA_HOME}/lib/tools.jar
Make sure you have updated with the correct java path else you will get the following error:
Error: Could not find or load main class com.sun.tools.javac.Main
Step 3: Compilation of WordCount Program
These commands will compile WordCount.java and create a jar file:
$ cd $HADOOP_HOME $ bin/hadoop com.sun.tools.javac.Main WordCount.java $ jar cf wc.jar WordCount*.class
Step 4: Creating Input Files
These commands will allow you to create input_dir in HDFS and upload file01 and file02 into it. You have to be in the $HADOOP_HOME directory to run these commands.
$ $HADOOP_HOME/bin/hadoop fs -mkdir -p input_dir $ nano file01 Hello world, Good Morning Neophytes. $ nano file02 Hello Hadoop, Good morning to Hadoop. $ $HADOOP_HOME/bin/hadoop fs -put -p /home/hdoop/hadoop-3.3.1/file01 input_dir $ $HADOOP_HOME/bin/hadoop fs -put -p /home/hdoop/hadoop-3.3.1/file02 input_dir
Step 5: Executing Application
$ bin/hadoop jar wc.jar WordCount input_dir output_dir
Wait for some time, and you will get results something like this.
2022-07-09 11:48:50,199 INFO mapreduce.Job: Running job: job_1657347551206_0001
2022-07-09 11:49:22,878 INFO mapreduce.Job: Job job_1657347551206_0001 running in uber mode : false
2022-07-09 11:49:22,880 INFO mapreduce.Job: map 0% reduce 0%
2022-07-09 11:49:58,616 INFO mapreduce.Job: map 100% reduce 0%
2022-07-09 11:50:13,897 INFO mapreduce.Job: map 100% reduce 100%
2022-07-09 11:50:14,933 INFO mapreduce.Job: Job job_1657347551206_0001 completed successfully
2022-07-09 11:50:15,270 INFO mapreduce.Job: Counters: 54
File System Counters
FILE: Number of bytes read=147
FILE: Number of bytes written=821951
FILE: Number of read operations=0
FILE: Number of large read operations=0
FILE: Number of write operations=0
HDFS: Number of bytes read=303
HDFS: Number of bytes written=82
HDFS: Number of read operations=11
HDFS: Number of large read operations=0
HDFS: Number of write operations=2
HDFS: Number of bytes read erasure-coded=0
Job Counters
Launched map tasks=2
Launched reduce tasks=1
Data-local map tasks=2
Total time spent by all maps in occupied slots (ms)=62984
Total time spent by all reduces in occupied slots (ms)=12503
Total time spent by all map tasks (ms)=62984
Total time spent by all reduce tasks (ms)=12503
Total vcore-milliseconds taken by all map tasks=62984
Total vcore-milliseconds taken by all reduce tasks=12503
Total megabyte-milliseconds taken by all map tasks=64495616
Total megabyte-milliseconds taken by all reduce tasks=12803072
Map-Reduce Framework
Map input records=2
Map output records=11
Map output bytes=119
Map output materialized bytes=153
Input split bytes=228
Combine input records=11
Combine output records=11
Reduce input groups=9
Reduce shuffle bytes=153
Reduce input records=11
Reduce output records=9
Spilled Records=22
Shuffled Maps =2
Failed Shuffles=0
Merged Map outputs=2
GC time elapsed (ms)=1087
CPU time spent (ms)=4160
Physical memory (bytes) snapshot=644632576
Virtual memory (bytes) snapshot=7456948224
Total committed heap usage (bytes)=581115904
Peak Map Physical memory (bytes)=251269120
Peak Map Virtual memory (bytes)=2483400704
Peak Reduce Physical memory (bytes)=144678912
Peak Reduce Virtual memory (bytes)=2490146816
Shuffle Errors
BAD_ID=0
CONNECTION=0
IO_ERROR=0
WRONG_LENGTH=0
WRONG_MAP=0
WRONG_REDUCE=0
File Input Format Counters
Bytes Read=75
File Output Format Counters
Bytes Written=82
Step 6: Results
You can check the result by running this command:
$ bin/hadoop fs -cat output_dir/part-r-00000
Good 2
Hadoop, 1
Hello 2
Morning 1
Neophytes! 1
hadoop. 1
morning 1
to 1
world, 1
Let’s Put It All Together:
We started out by defining MapReduce, explaining that it provides scalable Hadoop infrastructure.
Then, we proceeded to cover the best possible steps to run a MapReduce job on your Hadoop cluster. At last, we’ve got the results and you now know how to run your own custom MapReduce job. Feel free to ask questions if you need any help.
Good Luck!
Cool framework
ReplyDeleteI like your all post. You have done really good work. Thank you for the information you provide, it helped me a lot. I hope to have many more entries or so from you.
ReplyDeleteVery interesting blog.
global mapper crack
I like your all post. You have done really good work. Thank you for the information you provide, it helped me a lot. I hope to have many more entries or so from you.
ReplyDeleteVery interesting blog.
This article made me very delighted to read. Many thanks for the amazing information. Fantastic article.
ReplyDeleteI appreciate you sharing this educational information, and I've subscribed to your feed. Thanks.
creature-animation-pro-crack
I like your all post. You have done really good work. Thank you for the information you provide, it helped me a lot. I hope to have many more entries or so from you.
ReplyDeleteVery interesting blog.
I like your all post. You have done really good work. Thank you for the information you provide, it helped me a lot. I hope to have many more entries or so from you.
ReplyDeleteVery interesting blog.
I like your all post. You have done really good work. Thank you for the information you provide, it helped me a lot. I hope to have many more entries or so from you.
ReplyDeleteVery interesting blog.
I like your all post. You have done really good work. Thank you for the information you provide, it helped me a lot. I hope to have many more entries or so from you.
ReplyDeleteVery interesting blog.