Spark DataFrame orderBy和DataFrameWriter sortBy，有区别吗？

Question

Is there a difference in the output between sorting before or after the .write command on a DataFrame? output 在 DataFrame 上的.write命令之前或之后排序之间是否存在差异？

val people : DataFrame[Person]

people
        .orderBy("name")
        .write
        .mode(SaveMode.Append)
        .format("parquet")
        .saveAsTable("test_segments")

and和

val people : DataFrame[Person]

people
        .write
        .sortBy("name")
        .mode(SaveMode.Append)
        .format("parquet")
        .saveAsTable("test_segments")

Answer 1

The different between those is explained on the comments within the code:代码中的注释解释了它们之间的区别：

orderBy: Is a Dataset/Dataframe operation. orderBy：是 Dataset/Dataframe 操作。 Returns a new Dataset sorted by the given expressions.返回按给定表达式排序的新数据集。 This is an alias of the sort function.这是 function 排序的别名。
sortBy: Is a DataFrameWriter operation. sortBy：是一个DataFrameWriter操作。 Sorts the output in each bucket by the given columns.按给定列对每个存储桶中的 output 进行排序。

The sortBy method will only work when you are also defining buckets ( bucketBy ). sortBy方法仅在您还定义存储桶 ( bucketBy ) 时才有效。 Otherwise you will get an exception :否则你会得到一个异常：

if (sortColumnNames.isDefined && numBuckets.isEmpty) {
  throw new AnalysisException("sortBy must be used together with bucketBy")
}

The columns defined in sortBy are used in the BucketSpec as sortColumnNames like shown below: sortBy 中定义的列在sortBy中用作sortColumnNames ， sortColumnNames所示：

Params:
numBuckets – number of buckets.
bucketColumnNames – the names of the columns that used to generate the bucket id.
sortColumnNames – the names of the columns that used to sort data in each bucket.

case class BucketSpec(
    numBuckets: Int,
    bucketColumnNames: Seq[String],
    sortColumnNames: Seq[String])

Spark DataFrame orderBy和DataFrameWriter sortBy，有区别吗？

问题描述

1 个解决方案

解决方案1
3 已采纳 2021-02-18 06:29:03

Spark DataFrame orderBy和DataFrameWriter sortBy，有区别吗？

问题描述

1 个解决方案

解决方案1 3 已采纳 2021-02-18 06:29:03

解决方案1
3 已采纳 2021-02-18 06:29:03