简体   繁体   English

Mapreduce组合器

[英]Mapreduce Combiner

I have a simple mapreduce code with mapper, reducer and combiner. 我有一个简单的mapreduce代码,包括mapper,reducer和combiner。 The output from mapper is passed to combiner. mapper的输出传递给组合器。 But to the reducer, instead of output from combiner,output from mapper is passed. 但是对于reducer而言,不是来自组合器的输出,而是传递mapper的输出。

Kindly help 请帮助

Code: 码:

package Combiner;
import java.io.IOException;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.DoubleWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.Mapper.Context;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.util.GenericOptionsParser;

public class AverageSalary
{
public static class Map extends  Mapper<LongWritable, Text, Text, DoubleWritable> 
{
    public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException 
    {    
        String[] empDetails= value.toString().split(",");
        Text unit_key = new Text(empDetails[1]);      
        DoubleWritable salary_value = new DoubleWritable(Double.parseDouble(empDetails[2]));
        context.write(unit_key,salary_value);    

    }  
}
public static class Combiner extends Reducer<Text,DoubleWritable, Text,Text> 
{
    public void reduce(final Text key, final Iterable<DoubleWritable> values, final Context context)
    {
        String val;
        double sum=0;
        int len=0;
        while (values.iterator().hasNext())
        {
            sum+=values.iterator().next().get();
            len++;
        }
        val=String.valueOf(sum)+":"+String.valueOf(len);
        try {
            context.write(key,new Text(val));
        } catch (IOException e) {
            // TODO Auto-generated catch block
            e.printStackTrace();
        } catch (InterruptedException e) {
            // TODO Auto-generated catch block
            e.printStackTrace();
        }
    }
}
public static class Reduce extends Reducer<Text,Text, Text,Text> 
{
    public void reduce (final Text key, final Text values, final Context context)
    {
        //String[] sumDetails=values.toString().split(":");
        //double average;
        //average=Double.parseDouble(sumDetails[0]);
        try {
            context.write(key,values);
        } catch (IOException e) {
            // TODO Auto-generated catch block
            e.printStackTrace();
        } catch (InterruptedException e) {
            // TODO Auto-generated catch block
            e.printStackTrace();
        }
    }
}
public static void main(String args[])
{
    Configuration conf = new Configuration();
    try
    {
     String[] otherArgs = new GenericOptionsParser(conf, args).getRemainingArgs();    
     if (otherArgs.length != 2) {      
         System.err.println("Usage: Main <in> <out>");      
         System.exit(-1);    }    
     Job job = new Job(conf, "Average salary");    
     //job.setInputFormatClass(KeyValueTextInputFormat.class);    
     FileInputFormat.addInputPath(job, new Path(otherArgs[0]));    
     FileOutputFormat.setOutputPath(job, new Path(otherArgs[1]));    
     job.setJarByClass(AverageSalary.class);    
     job.setMapperClass(Map.class);    
     job.setCombinerClass(Combiner.class);
     job.setReducerClass(Reduce.class);    
     job.setOutputKeyClass(Text.class);    
     job.setOutputValueClass(Text.class);    

        System.exit(job.waitForCompletion(true) ? 0 : -1);
    } catch (ClassNotFoundException e) {
        // TODO Auto-generated catch block
        e.printStackTrace();
    } catch (IOException e) {
        // TODO Auto-generated catch block
        e.printStackTrace();
    } catch (InterruptedException e) {
        // TODO Auto-generated catch block
        e.printStackTrace();
    }
}

} }

The #1 rule of Combiners are: do not assume that the combiner will run . Combiners的第一条规则是: 不要假设组合器会运行 Treat the combiner only as an optimization . 将组合器视为优化

The Combiner is not guaranteed to run over all of your data. 无法保证Combiner可以运行您的所有数据。 In some cases when the data doesn't need to be spilled to disk, MapReduce will skip using the Combiner entirely. 在某些情况下,当数据不需要溢出到磁盘时,MapReduce将完全跳过使用Combiner。 Note also that the Combiner may be ran multiple times over subsets of the data! 另请注意,Combiner可能会在数据子集上运行多次! It'll run once per spill. 每次泄漏都会运行一次。

In your case, you are making this bad assumption. 在你的情况下,你正在做出这个错误的假设。 You should be doing the sum in the Combiner AND the Reducer. 你应该在Combiner和Reducer中做总和。

Also, you should follow @user987339's answer as well. 此外,您还应该关注@ user987339的答案。 The input and output of the combiner needs to be identical (Text,Double -> Text,Double) and it needs to match up with the output of the Mapper and the input of the Reducer. 组合器的输入和输出必须相同(Text,Double - > Text,Double),它需要与Mapper的输出和Reducer的输入相匹配。

It seems that you forgot about important property of a combiner: 你好像忘记了合成器的重要属性:

the input types for the key/value and the output types of the key/value need to be the same. 键/值的输入类型和键/值的输出类型必须相同。

You can't take in a Text/DoubleWritable and return a Text/Text . 您不能接受Text/DoubleWritable并返回Text/Text I suggest you to use Text Instead DoubleWritable , and do proper parsing inside Combiner . 我建议你使用Text而不是DoubleWritable ,并在Combiner进行适当的解析。

如果使用了combine函数,那么它与reduce函数的形式相同(并且是Reducer的实现),除了它的输出类型是中间键和值类型(K2和V2),因此它们可以提供reduce函数:map:(K1,V1)→list(K2,V2)组合:(K2,list(V2))→list(K2,V2)reduce:(K2,list(V2))→list(K3,V3)经常combine和reduce功能相同,在这种情况下,K3与K2相同,V3与V2相同。

Combiner will not work always when you run mapreduce . 运行mapreduce时, Combiner不会始终有效。

If there is at least three spill files (output of mapper written to local-disk) the combiner will execute so that the size of file can be reduced so that it can be easily transferred to reduce node. 如果至少有三个溢出文件(映射器的输出写入本地磁盘),则组合器将执行,以便可以减小文件的大小,以便可以轻松地将其传输到reduce节点。

The number of spills for which a combiner need to run can be set through min.num.spills.for.combine property 可以通过min.num.spills.for.combine属性设置组合器需要运行的溢出数量

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM