[英]Spark DataFrame orderBy and DataFrameWriter sortBy, is there a difference?
output 在 DataFrame 上的.write
命令之前或之后排序之间是否存在差异?
val people : DataFrame[Person]
people
.orderBy("name")
.write
.mode(SaveMode.Append)
.format("parquet")
.saveAsTable("test_segments")
和
val people : DataFrame[Person]
people
.write
.sortBy("name")
.mode(SaveMode.Append)
.format("parquet")
.saveAsTable("test_segments")
代码中的注释解释了它们之间的区别:
sortBy
方法仅在您还定义存储桶 ( bucketBy
) 时才有效。 否则你会得到一个异常:
if (sortColumnNames.isDefined && numBuckets.isEmpty) {
throw new AnalysisException("sortBy must be used together with bucketBy")
}
sortBy 中定义的列在sortBy
中用作sortColumnNames , sortColumnNames
所示:
Params:
numBuckets – number of buckets.
bucketColumnNames – the names of the columns that used to generate the bucket id.
sortColumnNames – the names of the columns that used to sort data in each bucket.
case class BucketSpec(
numBuckets: Int,
bucketColumnNames: Seq[String],
sortColumnNames: Seq[String])
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.