Hadoop：二级排序不起作用

Question

I have implemented an algorithm in Hadoop 1.2.1, where reducer code relies on the secondary sorting. 我已经在Hadoop 1.2.1中实现了一种算法，其中reducer代码依赖于二级排序。 However, when I run the algorithm one reducer receives sorted tuples, but the other does not. 但是，当我运行算法时，一个化简器会接收已排序的元组，而另一个则没有。 I've spent a lot of time trying to figure out why, but without any success. 我花了很多时间试图找出原因，但没有成功。

Does anyone know what might be the problem? 有谁知道可能是什么问题？ I assume it has to do with the secondary sort code. 我认为这与辅助排序代码有关。

Here is the code that implements the secondary sorting: 以下是实现二级排序的代码：

Composite key 复合键

    public class CompositeKey implements WritableComparable<CompositeKey>{
        public String key;
        public Integer position;
        @Override
        public void readFields(DataInput arg0) throws IOException {
            key = WritableUtils.readString(arg0);
            position = arg0.readInt();
        }
        @Override
        public void write(DataOutput arg0) throws IOException {
            WritableUtils.writeString(arg0, key);
            arg0.writeLong(position);
        }
        @Override
        public int compareTo(CompositeKey o) {
            int result = key.compareTo(o.key);
            if(0 == result) {
                result = position.compareTo(o.position);
            }
            return result;
        }
    }

KeyComparator KeyComparator

    public class CompositeKeyComparator extends WritableComparator {
         protected CompositeKeyComparator() {
                super(CompositeKey.class, true);
            }   
            @SuppressWarnings("rawtypes")
            @Override
            public int compare(WritableComparable w1, WritableComparable w2) {
                CompositeKey k1 = (CompositeKey)w1;
                CompositeKey k2 = (CompositeKey)w2;

                int result = k1.key.compareTo(k2.key);
                if(0 == result) {
                    result = -1* k1.position.compareTo(k2.position);
                }
                return result;
            }

    }

Grouping Comparator 分组比较器

    public class NaturalKeyGroupingComparator extends WritableComparator {
        protected NaturalKeyGroupingComparator() {
            super(CompositeKey.class, true);
        }   
        @SuppressWarnings("rawtypes")
        @Override
        public int compare(WritableComparable w1, WritableComparable w2) {
            CompositeKey k1 = (CompositeKey)w1;
            CompositeKey k2 = (CompositeKey)w2;

            return k1.key.compareTo(k2.key);
        }

    }

Partitioner 分区

    public class NaturalKeyPartitioner extends Partitioner<CompositeKey, ReduceValue> {
        @Override
        public int getPartition(CompositeKey key, ReduceValue val, int numPartitions) {
            int hash = key.key.hashCode();
            int partition = hash & Integer.MAX_VALUE % numPartitions;
            return partition;
        }

Job configuration 作业配置

    //secondary sort
    job.setPartitionerClass(NaturalKeyPartitioner.class);
    job.setGroupingComparatorClass(NaturalKeyGroupingComparator.class);
    job.setSortComparatorClass(CompositeKeyComparator.class);

If I execute this both on the pseudo distributed environment or on the cluster I notice that one reducer gets sorted tuples, while another does not. 如果我在伪分布式环境或集群上执行此操作，我会注意到一个reducer被排序的元组，而另一个则没有。 For example here is a excerpt showing tuples received by two reducers (First column is the primary ket, and the second is the secondary): 例如，下面的摘录显示了两个reduce接收到的元组（第一列是主要的ket，第二列是次要的ket）：

    First reducer:
    a1 0 
    a1 1 
    a1 11 
    a1 16 
    a1 27 
    a1 28 
    a1 34 
    a1 35 
    a1 37 
    a1 38 
    a1 43 
    a1 44 
    a1 46 
    a1 48 
    a1 50 
    a1 54 
    a1 55 
    a1 56 
    a1 57 
    a1 60 
    a1 61 
    a1 63 
    a1 64 
    a1 66 
    a1 69 
    a1 70 
    a1 72 
    a1 75 
    a1 76 
    a1 78 
    a1 79 
    a1 80 
    a1 84 
    a1 85 
    a1 86 
    a1 87 
    a1 88 
    a1 91 
    a1 92 
    a1 97 
    a1 102   
    a1 106    
    a1 108  
    a1 109 
    a1 110 
    a1 111 
    a1 116     
    a1 118  
    a1 119 
    a1 120  

    Second reducer:
    a2 87 
    a2 115
    a2 65 
    a2 90 
    a2 68 
    a2 119    
    a2 91 
    a2 0 
    a2 70 
    a2 3 
    a2 8 
    a2 9 
    a2 10 
    a2 71 
    a2 110   
    a2 16 
    a2 17 
    a2 20 
    a2 21 
    a2 23 
    a2 26 
    a2 72 
    a2 27 
    a2 94 
    a2 29 
    a2 30 
    a2 31 
    a2 75 
    a2 95 
    a2 36 
    a2 76 
    a2 117  
    a2 39 
    a2 40 
    a2 41 
    a2 42 
    a2 97 
    a2 79 
    a2 44 
    a2 45 
    a2 98 
    a2 46 
    a2 80 
    a2 49 
    a2 82 
    a2 50 
    a2 83 
    a2 100 
    a2 84 
    a2 112     
    a2 57 
    a2 59 
    a2 113      
    a2 60 
    a2 114       
    a2 61

Answer 1

I think it's because in your serialize/de-serialize logic for CompositeKey you write the position as a long but read it as an integer. 我认为这是因为在CompositeKey的序列化/反序列化逻辑中，您将位置写为long，但将其读取为整数。 That will mess up the comparison logic because you're not testing exactly the same thing you wrote to the context. 这将使比较逻辑混乱，因为您没有测试与写入上下文完全相同的内容。

Hadoop：二级排序不起作用

问题描述

Composite key 复合键

KeyComparator KeyComparator

Grouping Comparator 分组比较器

Partitioner 分区

Job configuration 作业配置

1 个解决方案

解决方案1
1 已采纳 2014-03-21 15:50:26

Hadoop：二级排序不起作用

问题描述

Composite key 复合键

KeyComparator KeyComparator

Grouping Comparator 分组比较器

Partitioner 分区

Job configuration 作业配置

1 个解决方案

解决方案1 1 已采纳 2014-03-21 15:50:26

解决方案1
1 已采纳 2014-03-21 15:50:26