Spark DataFrame分区程序为None

Question

[New to Spark] After creating a DataFrame I am trying to partition it based on a column in the DataFrame. [Spark的新增功能]创建DataFrame之后，我尝试根据DataFrame中的列对其进行分区。 When I check the partitioner using data_frame.rdd.partitioner I get None as output. 当我使用data_frame.rdd.partitioner检查分区器时，输出为None 。

Partitioning using -> 使用->进行分区

data_frame.repartition("column_name")

As per Spark documentation the default partitioner is HashPartitioner, how can I confirm that ? 根据Spark文档，默认分区程序是HashPartitioner，如何确认呢？

Also, how can I change the partitioner ? 另外，如何更改分区器？

Answer 1

That's to be expected. 这是意料之中的。 RDD converted from a Dataset doesn't preserve the partitioner , only the data distribution. 从Dataset转换的RDD 不会保留分区程序，而只会保留数据分发。

If you want to inspect partitioner of the RDD you should retrieve it from the queryExecution : 如果要检查RDD的分区程序，则应从queryExecution检索它：

scala> val df = spark.range(100).select($"id" % 3 as "id").repartition(42, $"id")
df: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [id: bigint]

scala> df.queryExecution.toRdd.partitioner
res1: Option[org.apache.spark.Partitioner] = Some(org.apache.spark.sql.execution.CoalescedPartitioner@4be2340e)

how can I change the partitioner ? 如何更改分区器？

In general you cannot. 一般来说，您不能。 There exist repartitionByRange method (see the linked thread), but otherwise Dataset Partitioner is not configurable. 存在repartitionByRange方法（请参见链接的线程），但否则Dataset Partitioner不可配置。

Spark DataFrame分区程序为None

问题描述

1 个解决方案

解决方案1
1 已采纳 2018-10-23 11:03:08

Spark DataFrame分区程序为None

问题描述

1 个解决方案

解决方案1 1 已采纳 2018-10-23 11:03:08

解决方案1
1 已采纳 2018-10-23 11:03:08