简体   繁体   English

为辅助排序创建复合键类

[英]Creating composite key class for Secondary Sort

I am trying to create a composite key class of a String uniqueCarrier and int month for Secondary Sort. 我正在尝试为Secondary Sort创建一个String uniqueCarrierint month的复合键类。 Can anyone tell me, what are the steps for the same. 谁能告诉我,相同的步骤是什么?

Looks like you have an equality problem since you're not using uniqueCarrier in your compareTo method. 由于您未在compareTo方法中使用uniqueCarrier,因此您似乎遇到了平等问题。 You need to use uniqueCarrier in your compareTo and equals methods (also define an equals method). 您需要在compareTo和equals方法中使用uniqueCarrier(还定义一个equals方法)。 From the java lang reference Java lang参考

The natural ordering for a class C is said to be consistent with equals if and only if e1.compareTo(e2) == 0 has the same boolean value as e1.equals(e2) for every e1 and e2 of class C. Note that null is not an instance of any class, and e.compareTo(null) should throw a NullPointerException even though e.equals(null) returns false. 当且仅当e1.compareTo(e2)== 0对于C类的每个e1和e2具有与e1.equals(e2)相同的布尔值时,类C的自然顺序才被认为与equals一致。 null不是任何类的实例,即使e.equals(null)返回false,e.compareTo(null)也应引发NullPointerException。

You can also implement a RawComparator so that you can compare them without deserializing for some faster performance. 您还可以实现RawComparator,以便可以比较它们而无需反序列化以获得更快的性能。

However, I recommend (as I always do) to not write things like Secondary Sort yourself. 但是,我建议(一如既往)不要写类似“中学排序”之类的东西。 These have been implemented (as well as dozens of other optimizations) in projects like Pig and Hive. 这些已在Pig和Hive等项目中实现(以及其他数十种优化)。 Eg if you were using Hive, all you need to write is: 例如,如果您使用的是Hive,则只需编写以下内容:

SELECT ...
FROM my_table
ORDER BY month, carrier;

The above is a lot simpler to write than trying to figure out how to write Secondary Sorts (and eventually when you need to use it again, how to do it in a generic fashion). 与尝试弄清楚如何编写次要排序(以及最终在需要再次使用它时,如何以通用方式进行编写)相比,上面的编写要简单得多。 MapReduce should be considered a low level programming paradigm and should only be used (IMHO) when you need high performance optimizations that you don't get from higher level projects like Pig or Hive. 应该将MapReduce视为低级编程范例,并且仅在需要高性能(例如Pig或Hive)无法获得的高性能优化时才使用MapReduce。

EDIT: Forgot to mention about Grouping comparators, see Matt's answer 编辑:忘记提及分组比较器,请参阅Matt的答案

Your compareTo() implementation is incorrect. 您的compareTo()实现不正确。 You need to sort first on uniqueCarrier , then on month to break equality: 您需要首先在uniqueCarrieruniqueCarrier ,然后在month上排序以打破相等性:

@Override
public int compareTo(CompositeKey other) {
    if (this.getUniqueCarrier().equals(other.getUniqueCarrier())) {
        return this.getMonth().compareTo(other.getMonth());
    } else {
        return this.getUniqueCarrier().compareTo(other.getUniqueCarrier());
    }
}

One suggestion though: I typically choose to implement my attributes directly as Writable types if possible (for example, IntWriteable month and Text uniqueCarrier ). 但是,有一个建议:我通常选择尽可能将我的属性直接实现为Writable类型(例如, IntWriteable monthText uniqueCarrier )。 This allows me to call write and readFields directly on them, and also use their compareTo . 这使我可以直接在它们上调用writereadFields ,也可以使用它们的compareTo Less code to write is always good... 编写更少的代码总是好的...

Speaking of less code, you don't have to call the parent constructor for your composite key. 说到更少的代码,您不必为复合键调用父构造函数。

Now for what is left to be done: 现在剩下要做的事情了:

My guess is you are still missing a hashCode() method, which should only return the hash of the attribute you want to group on, in this case uniqueCarrier . 我的猜测是您仍然缺少hashCode()方法,该方法仅应返回要分组的属性的哈希,在这种情况下为uniqueCarrier This method is called by the default Hadoop partitionner to distribute work across reducers. 默认Hadoop分区程序调用此方法以在化简器之间分配工作。

I would also write custom GroupingComparator and SortingComparator to make sure grouping happens only on uniqueCarrier , and that sorting behaves according to CompositeKey compareTo() : 我还将编写自定义GroupingComparator和SortingComparator来确保仅在uniqueCarrier上进行分组,并且排序根据CompositeKey compareTo()

public class CompositeGroupingComparator extends WritableComparator {
    public CompositeGroupingComparator() {
        super(CompositeKey.class, true);
    }

    @Override
    public int compare(WritableComparable a, WritableComparable b) {
        CompositeKey first = (CompositeKey) a;
        CompositeKey second = (CompositeKey) b;

        return first.getUniqueCarrier().compareTo(second.getUniqueCarrier());
    }
}


public class CompositeSortingComparator extends WritableComparator {
    public CompositeSortingComparator()
    {
        super (CompositeKey.class, true);
    }

    @Override
    public int compare (WritableComparable a, WritableComparable b){
        CompositeKey first = (CompositeKey) a;
        CompositeKey second = (CompositeKey) b;

        return first.compareTo(second);
    }
}

Then, tell your Driver to use those two: 然后,告诉您的驱动程序使用这两个:

job.setSortComparatorClass(CompositeSortingComparator.class);
job.setGroupingComparatorClass(CompositeGroupingComparator.class);

Edit: Also see Pradeep's suggestion of implementing RawComparator to prevent having to unmarshall to an Object each time, if you want to optimize further. 编辑:如果您想进一步优化,还请参见Pradeep关于实现RawComparator 建议,以防止每次必须解组到Object。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM