简体   繁体   English

如何在Scala Spark中对RDD进行排序?

[英]How to sort an RDD in Scala Spark?

Reading Spark method sortByKey : 读取Spark方法sortByKey:

sortByKey([ascending], [numTasks])   When called on a dataset of (K, V) pairs where K implements Ordered, returns a dataset of (K, V) pairs sorted by keys in ascending or descending order, as specified in the boolean ascending argument.

Is it possible to return just "N" amount of results. 是否有可能仅返回“ N”个结果。 So instead of returning all results, just return the top 10. I could convert the sorted collection to an Array and use take method but since this is an O(N) operation is there a more efficient method ? 因此,除了返回所有结果之外,仅返回前10个即可。我可以将排序后的集合转换为Array并使用take方法,但是由于这是O(N)操作,有没有更有效的方法?

If you only need the top 10, use rdd.top(10) . 如果只需要前十名,请使用rdd.top(10) It avoids sorting, so it is faster. 它避免了排序,因此速度更快。

rdd.top makes one parallel pass through the data, collecting the top N in each partition in a heap, then merges the heaps. rdd.top使数据并行通过,收集堆中每个分区的前N个,然后合并堆。 It is an O(rdd.count) operation. 一个O(rdd.count)操作。 Sorting would be O(rdd.count log rdd.count) , and incur a lot of data transfer — it does a shuffle, so all of the data would be transmitted over the network. 排序将为O(rdd.count log rdd.count) ,并且会导致大量数据传输-它会进行随机排序,因此所有数据都将通过网络传输。

Most likely you have already perused the source code: 您很可能已经仔细阅读了源代码:

  class OrderedRDDFunctions {
   // <snip>
  def sortByKey(ascending: Boolean = true, numPartitions: Int = self.partitions.size): RDD[P] = {
    val part = new RangePartitioner(numPartitions, self, ascending)
    val shuffled = new ShuffledRDD[K, V, P](self, part)
    shuffled.mapPartitions(iter => {
      val buf = iter.toArray
      if (ascending) {
        buf.sortWith((x, y) => x._1 < y._1).iterator
      } else {
        buf.sortWith((x, y) => x._1 > y._1).iterator
      }
    }, preservesPartitioning = true)
  }

And, as you say, the entire data must go through the shuffle stage - as seen in the snippet. 而且,正如您所说, 整个数据必须经过洗牌阶段-如摘要所示。

However, your concern about subsequently invoking take(K) may not be so accurate. 但是,您对随后调用take(K)的担心可能不太准确。 This operation does NOT cycle through all N items: 此操作不会循环浏览所有N个项目:

  /**
   * Take the first num elements of the RDD. It works by first scanning one partition, and use the
   * results from that partition to estimate the number of additional partitions needed to satisfy
   * the limit.
   */
  def take(num: Int): Array[T] = {

So then, it would seem: 这样看来:

O(myRdd.take(K)) << O(myRdd.sortByKey()) ~= O(myRdd.sortByKey.take(k)) (at least for small K) << O(myRdd.sortByKey().collect() O(myRdd.take(K))<< O(myRdd.sortByKey())〜= O(myRdd.sortByKey.take(k))(至少对于小K)<< O(myRdd.sortByKey()。collect ()

Another option, at least from PySpark 1.2.0, is the use of takeOrdered . 至少从PySpark 1.2.0起,另一个选择是使用takeOrdered

In ascending order: 升序排列:

rdd.takeOrdered(10)

In descending order: 降序排列:

rdd.takeOrdered(10, lambda x: -x)

Top k values for k,v pairs: k,v对的前k个值:

rdd.takeOrdered(10, lambda (k, v): -v)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM