[英]Spark DataFrame orderBy and DataFrameWriter sortBy, is there a difference?
Is there a difference in the output between sorting before or after the .write
command on a DataFrame? output 在 DataFrame 上的
.write
命令之前或之后排序之间是否存在差异?
val people : DataFrame[Person]
people
.orderBy("name")
.write
.mode(SaveMode.Append)
.format("parquet")
.saveAsTable("test_segments")
and和
val people : DataFrame[Person]
people
.write
.sortBy("name")
.mode(SaveMode.Append)
.format("parquet")
.saveAsTable("test_segments")
The different between those is explained on the comments within the code:代码中的注释解释了它们之间的区别:
The sortBy
method will only work when you are also defining buckets ( bucketBy
). sortBy
方法仅在您还定义存储桶 ( bucketBy
) 时才有效。 Otherwise you will get an exception :否则你会得到一个异常:
if (sortColumnNames.isDefined && numBuckets.isEmpty) {
throw new AnalysisException("sortBy must be used together with bucketBy")
}
The columns defined in sortBy
are used in the BucketSpec as sortColumnNames
like shown below: sortBy 中定义的列在
sortBy
中用作sortColumnNames , sortColumnNames
所示:
Params:
numBuckets – number of buckets.
bucketColumnNames – the names of the columns that used to generate the bucket id.
sortColumnNames – the names of the columns that used to sort data in each bucket.
case class BucketSpec(
numBuckets: Int,
bucketColumnNames: Seq[String],
sortColumnNames: Seq[String])
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.