为简单的hadoop mapreduce作业运行两个映射器和两个reducer

[英]Running two mapper and two reducer for simple hadoop mapreduce jobs

I just wanted to have a better understanding on using multiple mappers and reducers.I want to try this out using a simple hadoop mapreduce Word count job.I want to run two mapper and two reducer for this wordcount job.Is there that I need to configure manually on the configuration files or is it just enough to just make changes on the WordCount.java file. 我只是想更好地理解使用多个映射器和reducers。我想尝试使用一个简单的hadoop mapreduce字数计数作业。我想为这个wordcount作业运行两个映射器和两个reducer。我需要在那里手动配置配置文件或仅仅对WordCount.java文件进行更改就足够了。

I'm running this job on a Single node.And I'm running this job as 我在单个节点上运行这个工作。我正在运行这个工作

$ hadoop jar job.jar input output $ hadoop jar job.jar输入输出

And i've started 我已经开始了

$ hadoop namenode -format
$ hadoop namenode

$ hadoop datanode

sbin$ ./yarn-daemon.sh start resourcemanager sbin$ ./yarn-daemon.sh start resourcemanager sbin $ ./yarn-daemon.sh start resourcemanager sbin $ ./yarn-daemon.sh start resourcemanager

I'm running hadoop-2.0.0-cdh4.0.0 我正在运行hadoop-2.0.0-cdh4.0.0

And my WordCount.java file is 我的WordCount.java文件是

package org.apache.hadoop.examples;

import java.io.IOException;
import java.util.StringTokenizer;
import org.apache.commons.logging.Log;
import org.apache.commons.logging.LogFactory;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.io.IntWritable;
import org.rg.apache.hadoop.fs.Path;
import oapache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.util.GenericOptionsParser;

public class WordCount {
private static final Log LOG = LogFactory.getLog(WordCount.class);

  public static class TokenizerMapper
       extends Mapper<Object, Text, Text, IntWritable>{

    private final static IntWritable one = new IntWritable(1);
    private Text word = new Text();

    public void map(Object key, Text value, Context context
                    ) throws IOException, InterruptedException {
      StringTokenizer itr = new StringTokenizer(value.toString());
      while (itr.hasMoreTokens()) {
        context.write(word, one);

  public static class IntSumReducer
       extends Reducer<Text,IntWritable,Text,IntWritable> {
    private IntWritable result = new IntWritable();

    public void reduce(Text key, Iterable<IntWritable> values,
                       Context context
                       ) throws IOException, InterruptedException {
      int sum = 0;
      //printKeyAndValues(key, values);

      for (IntWritable val : values) {
        sum += val.get();
      LOG.info("val = " + val.get());
      LOG.info("sum = " + sum + " key = " + key);
      context.write(key, result);
      //System.err.println(String.format("[reduce] word: (%s), count: (%d)", key, result.get()));

  // a little method to print debug output
    private void printKeyAndValues(Text key, Iterable<IntWritable> values)
      StringBuilder sb = new StringBuilder();
      for (IntWritable val : values)
        sb.append(val.get() + ", ");
      System.err.println(String.format("[reduce] key: (%s), value: (%s)", key, sb.toString()));

  public static void main(String[] args) throws Exception {
    Configuration conf = new Configuration();
    String[] otherArgs = new GenericOptionsParser(conf, args).getRemainingArgs();
    if (otherArgs.length != 2) {
      System.err.println("Usage: wordcount <in> <out>");
    Job job = new Job(conf, "word count");
    FileInputFormat.addInputPath(job, new Path(otherArgs[0]));
    FileOutputFormat.setOutputPath(job, new Path(otherArgs[1]));

    System.exit(job.waitForCompletion(true) ? 0 : 1);

Could anyone of you help me now to run two mapper and the reducers for this Word count job? 你们当中有人可以帮助我运行两个映射器和减少器来进行这个Word计数工作吗?

Gladnick: In case you are planning to use the default TextInputFormat , there would be atleast as many mappers at the number of input files (or more depending on the file size). Gladnick:如果你打算使用默认的TextInputFormat ,那么在输入文件数量上会有至少数量的映射器(或者更多,具体取决于文件大小)。 So just put 2 files into your input directories so that you can get 2 mappers running. 所以只需将2个文件放入输入目录,这样就可以运行2个映射器。 (Advising this solution, because you plan to run this as a test case). (建议此解决方案,因为您计划将其作为测试用例运行)。

Now that you have asked for 2 reducers, all you need to do is job.setNumReduceTasks(2) in your main befor submiting the job. 既然您已经要求2个减速器,那么您需要做的就是在提交作业的主要工作中使用job.setNumReduceTasks(2)

After that just prepare a jar of your application and run that in hadoop pseudo cluster . 之后,只需准备一个应用程序的jar并在hadoop伪集群中运行它。

In case you need to specify which word to go to which reducer, you can specify that in the Partitioner class. 如果您需要指定哪个单词去哪个reducer,您可以在Partitioner类中指定。

            Configuration configuration = new Configuration();
        // create a configuration object that provides access to various
        // configuration parameters
        Job job = new Job(configuration, "Wordcount-Vowels & Consonants");
        // create the job object and set job name as Wordcount-Vowels &
        // Consonants
        // set the main class
        // set the number of reduce tasks required
        // set the map class for the job
        // set the combiner class for the job
        // set the partitioner class for the job
        // set the reduce class for the job
        // set the output type of key (the word) expected from the job, Text
        // analogous to String
        // set the output type of value (the count) expected from the job,
        // IntWritable analogous to int
        FileInputFormat.addInputPath(job, new Path(args[0]));
        // set the input directory for fetching the input files
        FileOutputFormat.setOutputPath(job, new Path(args[1])); 

This should be the structure of your main program. 这应该是主程序的结构。 You may include the combiner and the partitioner in case needed. 如果需要,您可以包括组合器和分区器。

For mappers set 对于映射器设置


to half the size of your file. 一半大小的文件。

For reducers set them to 2 explicitly as 对于Reducer,将它们明确地设置为2


