[英]Spark - Word Count with Sorting (is not sorting)
我正在學習Spark並嘗試擴展WordCount示例,並根據其出現次數對單詞進行排序。 問題出在哪里,運行代碼后我得到的結果未排序:
(708,word1)
(46,word2)
(65,word3)
因此,出於某種原因,排序似乎失敗了。 類似的效果是使用wordSortedByCount.first()命令並限制只執行一個線程。
import java.io.Serializable;
import java.util.Arrays;
import java.util.Comparator;
import java.util.List;
import org.apache.spark.SparkConf;
import org.apache.spark.api.java.JavaPairRDD;
import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.api.java.JavaSparkContext;
import org.apache.spark.api.java.function.Function;
import scala.Tuple2;
public class JavaWordCount2 {
public static void main(String[] args) {
SparkConf sparkConf = new SparkConf().setAppName("JavaWordCountAndSort");
int numOfKernels = 8;
sparkConf.setMaster("local[" + numOfKernels + "]");
JavaSparkContext ctx = new JavaSparkContext(sparkConf);
JavaRDD<String> lines = ctx.textFile("data.csv", 1);
JavaRDD<String> words = lines.flatMap(line -> Arrays.asList(line
.split("[,; :\\.]")));
words = words.flatMap(line -> Arrays.asList(line.replaceAll("[\"\\(\\)]", "").toLowerCase()));
// sum words
JavaPairRDD<String, Integer> counts = words.mapToPair(
w -> new Tuple2<String, Integer>(w, 1)).reduceByKey(
(x, y) -> x + y);
// minimum 5 occurences
// counts = counts.filer(s -> s._2 > 5);
counts = counts.filter(new Function<Tuple2<String,Integer>, Boolean>() {
@Override
public Boolean call(Tuple2<String, Integer> v1) throws Exception {
return v1._2 > 5;
}
});
// to enable sorting by value (count) and not key -> value-to-key conversion pattern
// setting value to null, since it won't be used anymore
JavaPairRDD<Tuple2<Integer, String>, Integer> countInKey = counts.mapToPair(a -> new Tuple2(new Tuple2<Integer, String>(a._2, a._1), null));
// sort by num of occurences
JavaPairRDD<Tuple2<Integer, String>, Integer> wordSortedByCount = countInKey.sortByKey(new TupleComparator(), true);
// print result
List<Tuple2<Tuple2<Integer, String>, Integer>> output = wordSortedByCount.take(10);
for (Tuple2<?, ?> tuple : output) {
System.out.println(tuple._1());
}
ctx.stop();
}
}
比較類:
import java.io.Serializable;
import java.util.Comparator;
import scala.Tuple2;
public class TupleComparator implements Comparator<Tuple2<Integer, String>>,
Serializable {
@Override
public int compare(Tuple2<Integer, String> tuple1,
Tuple2<Integer, String> tuple2) {
return tuple1._1 < tuple2._1 ? 0 : 1;
}
}
誰能指出我的代碼有什么問題?
您的代碼的第一個問題是在比較器中。 實際上,您返回0或1,而compare
方法應返回一些負值,無論第一個元素是否在第二個元素之前。 所以改成它:
@Override
public int compare(Tuple2<Integer, String> tuple1,
Tuple2<Integer, String> tuple2) {
return tuple1._1 - tuple2._1;
}
此外,您應該將sortByKey
的第二個參數sortByKey
為false
,否則您將獲得升序,即從最低到最大,這與您想要的完全相反。
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.