[英]Spark DataFrame partitioner is None
[New to Spark] After creating a DataFrame I am trying to partition it based on a column in the DataFrame. [Spark的新增功能]创建DataFrame之后,我尝试根据DataFrame中的列对其进行分区。 When I check the partitioner using data_frame.rdd.partitioner
I get None as output. 当我使用data_frame.rdd.partitioner
检查分区器时,输出为None 。
Partitioning using -> 使用->进行分区
data_frame.repartition("column_name")
As per Spark documentation the default partitioner is HashPartitioner, how can I confirm that ? 根据Spark文档,默认分区程序是HashPartitioner,如何确认呢?
Also, how can I change the partitioner ? 另外,如何更改分区器?
That's to be expected. 这是意料之中的。 RDD
converted from a Dataset
doesn't preserve the partitioner , only the data distribution. 从Dataset
转换的RDD
不会保留分区程序 ,而只会保留数据分发。
If you want to inspect partitioner of the RDD you should retrieve it from the queryExecution
: 如果要检查RDD的分区程序,则应从queryExecution
检索它:
scala> val df = spark.range(100).select($"id" % 3 as "id").repartition(42, $"id")
df: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [id: bigint]
scala> df.queryExecution.toRdd.partitioner
res1: Option[org.apache.spark.Partitioner] = Some(org.apache.spark.sql.execution.CoalescedPartitioner@4be2340e)
how can I change the partitioner ? 如何更改分区器?
In general you cannot. 一般来说,您不能。 There exist repartitionByRange
method (see the linked thread), but otherwise Dataset
Partitioner
is not configurable. 存在repartitionByRange
方法(请参见链接的线程),但否则Dataset
Partitioner
不可配置。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.