简体   繁体   中英

Hadoop MapReduce Java

I'm trying to learn hadoop and there is an example like below in documentation . I can't understand what does means this parameters . Please help me understanding map and reduce methods . I read books about hadoop mapReduce and theoretically understand but I don't understand code .

import java.io.IOException;
import java.util.StringTokenizer;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

public class WordCount {

public static class TokenizerMapper
   extends Mapper<Object, Text, Text, IntWritable>{

private final static IntWritable one = new IntWritable(1);
private Text word = new Text();

public void map(Object key, Text value, Context context
                ) throws IOException, InterruptedException {
  StringTokenizer itr = new StringTokenizer(value.toString());
  while (itr.hasMoreTokens()) {
    word.set(itr.nextToken());
    context.write(word, one);
   }
  }
 }

public static class IntSumReducer
   extends Reducer<Text,IntWritable,Text,IntWritable> {
private IntWritable result = new IntWritable();

public void reduce(Text key, Iterable<IntWritable> values,
                   Context context
                   ) throws IOException, InterruptedException {
  int sum = 0;
  for (IntWritable val : values) {
    sum += val.get();
  }
  result.set(sum);
  context.write(key, result);
 }
 }

 public static void main(String[] args) throws Exception {
 Configuration conf = new Configuration();
 Job job = Job.getInstance(conf, "word count");
 job.setJarByClass(WordCount.class);
 job.setMapperClass(TokenizerMapper.class);
 job.setCombinerClass(IntSumReducer.class);
 job.setReducerClass(IntSumReducer.class);
 job.setOutputKeyClass(Text.class);
 job.setOutputValueClass(IntWritable.class);
 FileInputFormat.addInputPath(job, new Path(args[0]));
 FileOutputFormat.setOutputPath(job, new Path(args[1]));
 System.exit(job.waitForCompletion(true) ? 0 : 1);
 }
 }

MapReduce is a programming model and an associated implementation for processing and generating big data sets with a parallel, distributed algorithm on a cluster.

A MapReduce program is composed of a map procedure (or method), which performs filtering and sorting (such as sorting students by first name into queues, one queue for each name), and a reduce method, which performs a summary operation (such as counting the number of students in each queue, yielding name frequencies).

As for the parameters:

Mapper<Object, Text, Text, IntWritable>

They are basically all the DataTypes for: InputKey, InputValue, OutputKey, OutputValue . That simply means that the datatype for InputKey is Object, InputValue is Text, OutputKey is Text and OutputValue is IntWritable.

The same for Reducer.

But I believe you are more interested in the map and reduce methods.

public void map(Object key, Text value, Context context)

Here, the key and value are simply the values for the "KEYIN" and "VALUEIN".

Context object: allows the Mapper/Reducer to interact with the rest of the Hadoop system. It includes configuration data for the job as well as interfaces which allow it to emit output.

Applications can use the Context:

  • to report progress
  • to set application-level status messages
  • update Counters
  • indicate they are alive
  • to get the values that are stored in job configuration across map/reduce phase.

So, how is the map method invoked?

To explain it at the simplest level, for each file in your folder a Mapper Class is instantiated. And for each line in the file, the map method is called.

Consider this file:

Hi I love Hadoop.

I code in Java. 

The map will be called 2 times.

  1. Key: 0, Value: Hi I love Hadoop.
  2. Key: 17, Value: I code in Java.

Key is the offset of the text file.


The output of the map method, is the input to the reduce method.

context.write(word, one);

This line will send this key/value pair to the reducer. Therefore, the datatypes for OutputKey and OutputValue of Mapper is the same as InputKey and InputValue of Reducer.

Hope this helps. Let me know if you have any further questions. Good luck.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM