简体   繁体   English

Hadoop的reducer上的奇怪行为

[英]Strange behavior on Hadoop's reducer

I have a simple class called Pair that implements org.apache.hadoop.io.Writable . 我有一个名为Pair的简单类,该类实现org.apache.hadoop.io.Writable It contains two fields and is used as a Value in the MapReduce process. 它包含两个字段,并在MapReduce流程中用作值。

For each key, I want to find the pair with the largest value of one of Pair's fields (preco). 对于每个键,我想找到对值最大的对,其中一个对是字段(preco)。 In the reducer the following code produces the expected result: 在reducer中,以下代码产生预期的结果:

float max = 0;
String country = "";
for (Pair p : values){
    if (p.getPreco().get() > max)
    {
        max = p.getPreco().get();
        country = p.getPais().toString();
    }
}
context.write(key, new Pair(new FloatWritable(max), new Text(country)));

The following code, on the other hand, does not: 另一方面,以下代码不会:

Pair max = new Pair();
for (Pair p : values)
    if (p.getPreco().get() > max.getPreco().get())
        max = p;

context.write(key, max);

The second code produces, for each key, the last value that is associated to it in the input file and not the highest value. 第二个代码为每个键生成在输入文件中与其关联的最后一个值,而不是最大值。

Is there a reason for this apparently strange behavior? 这种明显奇怪的行为是否有原因?

You have this problem because the reducer is reusing objects, so its iterator over the values is always passing you the same object. 您有这个问题,因为化简器正在重用对象,因此其对值的迭代器始终将同一个对象传递给您。 Thus this code: 因此这段代码:

max = p;

Will always refer the current value of p . 将始终引用p的当前值。 You need to copy the the data into max for this to work properly and not reference the object. 您需要将数据复制到max ,以使其正常工作并且不引用该对象。 This is why the first version of your code is working. 这就是为什么您的代码的第一个版本有效的原因。

Usually in Hadoop I would implement a .set() method on a custom writable, this is a common pattern you will see. 通常在Hadoop中,我会在自定义可写对象上实现.set()方法,这是您会看到的常见模式。 So your Pair class might look a bit like (its missing the interface methods etc): 因此,您的Pair类可能看起来有点像(它缺少接口方法等):

public class Pair implements Writable {

    public FloatWritable max = new FloatWritable();
    public Text country = new Text();

    public void set(Pair p) {
        this.max.set(p.max.get());
        this.country.set(p.country);
    }
}

And you would change your code to: 然后您将代码更改为:

Pair max = new Pair();
for (Pair p : values) {
    if (p.max().get() > max.max.get()) {
        max.set(p);
    }
}
context.write(key, max);

I haven't created getters in Pair so the code is changed slightly to directly access the public class variables. 我没有在Pair创建getters ,因此代码稍作更改即可直接访问公共类变量。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM