简体   繁体   中英

Spark - Word Count with Sorting (is not sorting)

I am learning Spark and trying to extend the WordCount example with sorting words by its number of occurrences. Where the problem is, after running the code I got results unsorted:

(708,word1)
(46,word2)
(65,word3)

So it seems the sorting failed for some reason. Similar effect is with wordSortedByCount.first() command and with limiting execution to only one thread.

import java.io.Serializable;
import java.util.Arrays;
import java.util.Comparator;
import java.util.List;

import org.apache.spark.SparkConf;
import org.apache.spark.api.java.JavaPairRDD;
import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.api.java.JavaSparkContext;
import org.apache.spark.api.java.function.Function;

import scala.Tuple2;

public class JavaWordCount2 {
    public static void main(String[] args) {
        SparkConf sparkConf = new SparkConf().setAppName("JavaWordCountAndSort");
        int numOfKernels = 8;
        sparkConf.setMaster("local[" + numOfKernels + "]");
        JavaSparkContext ctx = new JavaSparkContext(sparkConf);

        JavaRDD<String> lines = ctx.textFile("data.csv", 1);
        JavaRDD<String> words = lines.flatMap(line -> Arrays.asList(line
                .split("[,; :\\.]")));
        words = words.flatMap(line -> Arrays.asList(line.replaceAll("[\"\\(\\)]", "").toLowerCase()));

        // sum words
        JavaPairRDD<String, Integer> counts = words.mapToPair(
                w -> new Tuple2<String, Integer>(w, 1)).reduceByKey(
                (x, y) -> x + y);

        // minimum 5 occurences
        // counts = counts.filer(s -> s._2 > 5);
        counts = counts.filter(new Function<Tuple2<String,Integer>, Boolean>() {
            @Override
            public Boolean call(Tuple2<String, Integer> v1) throws Exception {
                return v1._2 > 5;
            }
        });

        // to enable sorting by value (count) and not key -> value-to-key conversion pattern
        // setting value to null, since it won't be used anymore
        JavaPairRDD<Tuple2<Integer, String>, Integer> countInKey = counts.mapToPair(a -> new Tuple2(new Tuple2<Integer, String>(a._2, a._1), null));

        // sort by num of occurences
        JavaPairRDD<Tuple2<Integer, String>, Integer> wordSortedByCount = countInKey.sortByKey(new TupleComparator(), true);

        // print result    
        List<Tuple2<Tuple2<Integer, String>, Integer>> output = wordSortedByCount.take(10);
        for (Tuple2<?, ?> tuple : output) {
            System.out.println(tuple._1());
        }
        ctx.stop();
    }
}

Class for comparison:

import java.io.Serializable;
import java.util.Comparator;
import scala.Tuple2;
public class TupleComparator implements Comparator<Tuple2<Integer, String>>,
        Serializable {
    @Override
    public int compare(Tuple2<Integer, String> tuple1,
            Tuple2<Integer, String> tuple2) {
        return tuple1._1 < tuple2._1 ? 0 : 1;
    }
}

Could anyone point me what can be wrong with the code?

The first problem with your code is in the comparator. In fact, you are returning 0 or 1, while the compare method should return some negative value whether the first element comes before the second one. So change it to:

@Override
public int compare(Tuple2<Integer, String> tuple1,
        Tuple2<Integer, String> tuple2) {
    return tuple1._1 - tuple2._1;
}

Moreover, you should put the second parameter of sortByKey to false , otherwise you'll get an ascending order, ie from the lowest to the greatest, which is exactly the opposite you want I think.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM