简体   繁体   English

使用复合键时遍历值时部分键更改-Hadoop

[英]Part of key changes when iterating through values when using composite key - Hadoop

I have implemented Secondary sort on Hadoop and I don't really understand the behavior of the framework. 我已经在Hadoop上实现了二级排序,但我不太了解该框架的行为。

I have created a composite key which contains original key and part of value, that is used for sorting. 我创建了一个复合键,其中包含原始键和部分值,用于排序。

To achieve this I have implemented my own partitioner 为此,我实现了自己的分区程序

public class CustomPartitioner extends Partitioner<CoupleAsKey, LongWritable>{

@Override
public int getPartition(CoupleAsKey couple, LongWritable value, int numPartitions) {

    return Long.hashCode(couple.getKey1()) % numPartitions;
}

My own group comparator 我自己的小组比较者

public class GroupComparator extends WritableComparator {

protected GroupComparator()
{
    super(CoupleAsKey.class, true);
}

@Override
public int compare(WritableComparable w1, WritableComparable w2) {

    CoupleAsKey c1 = (CoupleAsKey)w1;
    CoupleAsKey c2 = (CoupleAsKey)w2;

    return Long.compare(c1.getKey1(), c2.getKey1());
}

} }

And defined the couple in the following way 并通过以下方式定义了这对夫妻

public class CoupleAsKey implements WritableComparable<CoupleAsKey>{

private long key1;
private long key2;

public CoupleAsKey() {
}

public CoupleAsKey(long key1, long key2) {
    this.key1 = key1;
    this.key2 = key2;
}

public long getKey1() {
    return key1;
}

public void setKey1(long key1) {
    this.key1 = key1;
}

public long getKey2() {
    return key2;
}

public void setKey2(long key2) {
    this.key2 = key2;
}

@Override
public void write(DataOutput output) throws IOException {

    output.writeLong(key1);
    output.writeLong(key2);

}

@Override
public void readFields(DataInput input) throws IOException {

    key1 = input.readLong();
    key2 = input.readLong();
}

@Override
public int compareTo(CoupleAsKey o2) {

    int cmp = Long.compare(key1, o2.getKey1());

    if(cmp != 0)
        return cmp;

    return Long.compare(key2, o2.getKey2());
}

@Override
public String toString() {
    return key1 + ","  + key2 + ",";
}

} }

And here is the driver 这是司机

Configuration conf = new Configuration();
    Job job = new Job(conf);

    job.setJarByClass(SSDriver.class);

    job.setMapperClass(SSMapper.class);
    job.setReducerClass(SSReducer.class);

    job.setMapOutputKeyClass(CoupleAsKey.class);
    job.setMapOutputValueClass(LongWritable.class);
    job.setPartitionerClass(CustomPartitioner.class);
    job.setGroupingComparatorClass(GroupComparator.class);

    FileInputFormat.addInputPath(job, new Path("/home/marko/WORK/Whirlpool/input.csv"));
    FileOutputFormat.setOutputPath(job, new Path("/home/marko/WORK/Whirlpool/output"));

    job.waitForCompletion(true);

Now, this works, but what is really strange is that while iterating in reducer for a key, second part of the key (the value part) changes in each iteration. 现在,这可行,但是真正奇怪的是,当在化简器中迭代某个键时,该键的第二部分(值部分)在每次迭代中都会更改。 Why and how? 为什么以及如何?

 @Override
protected void reduce(CoupleAsKey key, Iterable<LongWritable> values, Context context) throws IOException, InterruptedException {

    for (LongWritable value : values) {

        //key.key2 changes during iterations, why?
        context.write(key, value);
    }

}

Definition says that "if you want all your relevant rows within a partition of data sent to a single reducer you must implement a grouping comparator". 定义说: “如果要将数据分区中的所有相关行发送到单个reducer,则必须实现分组比较器”。 This only ensures that those set of keys will be sent to a single reduce call, and not that the key will change from composite (or whatever) to something that only contains that part of key on which grouping was done. 这只能确保将这组键发送给单个 reduce调用,而不是确保键将从复合键(或其他键)更改为仅包含完成分组的键的那部分。

However, when you iterate over values, the corresponding keys will also change. 但是,当您遍历值时,相应的键也会更改。 We normally do not observe this happening, as by default the values are grouped on the same (non-composite) key, and thus, even when the value changes, the (value of-) key remains the same. 我们通常不会观察到这种情况的发生,因为默认情况下,值被分组在同一(非复合)键上,因此,即使值发生更改,(-的)键也将保持不变。

You can try printing the object reference of the key, and you shall notice that with every iteration, the object reference of the key is also changing (like this:) 您可以尝试打印键的对象引用,并且您会注意到,每次迭代时,键的对象引用也在变化(如下所示:)

IntWritable@1235ft
IntWritable@6635gh
IntWritable@9804as

Alternatively, you can also try applying a group-comparator on an IntWritable in a following way (you will have to write your own logic to do so): 另外,您也可以尝试通过以下方式在IntWritable上应用组比较器(您必须编写自己的逻辑才能这样做):

Group1:    
1        a    
1        b    
2        c

Group2:
3        c
3        d
4        a

and you shall see that with every iteration of value, your key is also changing. 您将看到,随着价值的每次迭代,您的密钥也在变化。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM