简体   繁体   中英

how to implement a group comparator for hadoop?

Given a class called KeyLabelDistance which I am passing as the key and value in Hadoop,I want to perform secondary sort on it, ie I first want to sort the keys based on the increasing value of key and then in the DECREASING order of the distances.

In order to to do this I need to write my own GroupingComparator.My question is since the setGroupingComparator() method takes as a parameter only a class which extends RawComparator, how do I perform this comparison in the grouping comparator in terms of bytes? Do I need to explicitly serialize and deserialize the objects? And also does having the class KeyLabelDistance implement WritableComparable as follows make the need for a SortComparator as redundant?

I got the use of SortComparator and GroupComparator from this answer : What are the differences between Sort Comparator and Group Comparator in Hadoop?

Following is the implementation of KeyLabelDistance:

public class KeyLabelDistance implements WritableComparable<KeyLabelDistance>
    {
        private int key;
        private int label;
        private double distance;
        KeyLabelDistance()
        {
            key = 0;
            label = 0;
            distance = 0;
        }
        KeyLabelDistance(int key, int label, double distance)
        {
            this.key = key;
            this.label = label;
            this.distance = distance;
        }
        public int getKey() {
            return key;
        }
        public void setKey(int key) {
            this.key = key;
        }
        public int getLabel() {
            return label;
        }
        public void setLabel(int label) {
            this.label = label;
        }
        public double getDistance() {
            return distance;
        }
        public void setDistance(double distance) {
            this.distance = distance;
        }

        public int compareTo(KeyLabelDistance lhs, KeyLabelDistance rhs)
        {
            if(lhs == rhs)
                return 0;
            else
            {
                if(lhs.getKey() < rhs.getKey())
                    return -1;
                else if(lhs.getKey() > rhs.getKey())
                    return 1;
                else
                {
                    //If the keys are equal, look at the distances -> since more is the "distance" more is the "similarity", the comparison is counterintuitive
                    if(lhs.getDistance() < rhs.getDistance() )
                        return 1;
                    else if(lhs.getDistance() > rhs.getDistance())
                        return -1;
                    else return 0;
                }
            }
        }
    }

The code for the group comparator is as follows:

public class KeyLabelDistanceGroupingComparator extends WritableComparator{
    public int compare (KeyLabelDistance lhs, KeyLabelDistance rhs)
    {
        if(lhs == rhs)
            return 0;
        else
        {
            if(lhs.getKey() < rhs.getKey())
                return -1;
            else if(lhs.getKey() > rhs.getKey())
                return 1;
            return 0;
        }
    }
}

Any help is appreciated.Thanks in advance.

You can extend WritableComparator which in turn implements RawComparator. Both your sorting & grouping comparator will extend WritableComparator.

If you do not provide these comparators hadoop will internally end up using compareTo of the writable which is your key.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM