简体   繁体   English

Spark DataFrame orderBy和DataFrameWriter sortBy,有区别吗?

[英]Spark DataFrame orderBy and DataFrameWriter sortBy, is there a difference?

Is there a difference in the output between sorting before or after the .write command on a DataFrame? output 在 DataFrame 上的.write命令之前或之后排序之间是否存在差异?

val people : DataFrame[Person]

people
        .orderBy("name")
        .write
        .mode(SaveMode.Append)
        .format("parquet")
        .saveAsTable("test_segments") 

and

val people : DataFrame[Person]

people
        .write
        .sortBy("name")
        .mode(SaveMode.Append)
        .format("parquet")
        .saveAsTable("test_segments") 

The different between those is explained on the comments within the code:代码中的注释解释了它们之间的区别:

  • orderBy: Is a Dataset/Dataframe operation. orderBy:是 Dataset/Dataframe 操作。 Returns a new Dataset sorted by the given expressions.返回按给定表达式排序的新数据集。 This is an alias of the sort function.这是 function 排序的别名。
  • sortBy: Is a DataFrameWriter operation. sortBy:是一个DataFrameWriter操作。 Sorts the output in each bucket by the given columns.按给定列对每个存储桶中的 output 进行排序。

The sortBy method will only work when you are also defining buckets ( bucketBy ). sortBy方法仅在您还定义存储桶 ( bucketBy ) 时才有效。 Otherwise you will get an exception :否则你会得到一个异常

if (sortColumnNames.isDefined && numBuckets.isEmpty) {
  throw new AnalysisException("sortBy must be used together with bucketBy")
}

The columns defined in sortBy are used in the BucketSpec as sortColumnNames like shown below: sortBy 中定义的列在sortBy中用作sortColumnNamessortColumnNames所示:

Params:
numBuckets – number of buckets.
bucketColumnNames – the names of the columns that used to generate the bucket id.
sortColumnNames – the names of the columns that used to sort data in each bucket.

case class BucketSpec(
    numBuckets: Int,
    bucketColumnNames: Seq[String],
    sortColumnNames: Seq[String])

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM