[英]How to apply partial sort on a Spark DataFrame?
The following code:以下代码:
val myDF = Seq(83, 90, 40, 94, 12, 70, 56, 70, 28, 91).toDF("number")
myDF.orderBy("number").limit(3).show
outputs:输出:
+------+
|number|
+------+
| 12|
| 28|
| 40|
+------+
Does Spark's laziness in combination with the limit
call and the implementation of orderBy
automatically result in a partially sorted DataFrame, or are the remaining 7 numbers also sorted, even though it's not needed? Spark 的懒惰与limit
调用和orderBy
的实现相结合是否会自动导致部分排序的 DataFrame,或者其余 7 个数字是否也已排序,即使它不需要? And if so, is there a way to avoid this needless computational work?如果是这样,有没有办法避免这种不必要的计算工作?
Using .explain()
shows, that two sorts stages are performed, first on each partition and then (with the top 3 each) a global one.使用.explain()
显示,执行了两个排序阶段,首先在每个分区上,然后(每个分区前 3 个)一个全局阶段。 But it does not state if these sorts are full or partial.但是,如果这些类型是全部或部分的,则不是 state。
myDF.orderBy("number").limit(3).explain(true)
== Parsed Logical Plan ==
GlobalLimit 3
+- LocalLimit 3
+- Sort [number#3416 ASC NULLS FIRST], true
+- Project [value#3414 AS number#3416]
+- LocalRelation [value#3414]
== Analyzed Logical Plan ==
number: int
GlobalLimit 3
+- LocalLimit 3
+- Sort [number#3416 ASC NULLS FIRST], true
+- Project [value#3414 AS number#3416]
+- LocalRelation [value#3414]
== Optimized Logical Plan ==
GlobalLimit 3
+- LocalLimit 3
+- Sort [number#3416 ASC NULLS FIRST], true
+- LocalRelation [number#3416]
== Physical Plan ==
TakeOrderedAndProject(limit=3, orderBy=[number#3416 ASC NULLS FIRST], output=[number#3416])
+- LocalTableScan [number#3416]
If you explain()
your dataframe, you'll find that Spark will first do a "local" sort within each partition, and then pick only top three elements from each for a final global sort before taking the top three out of it.如果您对 dataframe 进行explain()
,您会发现 Spark 将首先在每个分区中进行“本地”排序,然后从每个分区中仅选择前三个元素进行最终的全局排序,然后再取出前三个。
scala> myDF.orderBy("number").limit(3).explain(true)
== Parsed Logical Plan ==
GlobalLimit 3
+- LocalLimit 3
+- Sort [number#3 ASC NULLS FIRST], true
+- Project [value#1 AS number#3]
+- LocalRelation [value#1]
== Analyzed Logical Plan ==
number: int
GlobalLimit 3
+- LocalLimit 3
+- Sort [number#3 ASC NULLS FIRST], true
+- Project [value#1 AS number#3]
+- LocalRelation [value#1]
== Optimized Logical Plan ==
GlobalLimit 3
+- LocalLimit 3
+- Sort [number#3 ASC NULLS FIRST], true
+- LocalRelation [number#3]
== Physical Plan ==
TakeOrderedAndProject(limit=3, orderBy=[number#3 ASC NULLS FIRST], output=[number#3])
+- LocalTableScan [number#3]
I think its best seen in the Optimized Logical Plan section, but physical says the same thing.我认为它在优化的逻辑计划部分中得到了最好的体现,但物理上也说了同样的话。
1 => will do full sort and then pick first 3 elements. 1 => 将进行完整排序,然后选择前 3 个元素。
2 => will return dataframe with first 3 elements and sort. 2 => 将返回 dataframe 和前 3 个元素并排序。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.