查看Spark数据框的分区信息

Question

One can have an array of partitions of a Spark DataFrame as follows: 可以具有一个Spark DataFrame的分区数组，如下所示：

> df.rdd.partitions

Is there a way to get more information about partitions? 有没有办法获取有关分区的更多信息？ In particular, I would like to see the partition key and the partition boundaries (first and last element within a partition). 特别是，我想查看分区键和分区边界（分区中的第一个和最后一个元素）。

This is just for better understanding of how the data is organized. 这只是为了更好地理解数据的组织方式。

This is what I tried: 这是我尝试的：

> df.partitions.rdd.head

But this object only has attributes and methods equals hashCode and index . 但是此对象仅具有equals hashCode和index属性和方法。

Answer 1

In case the data is not too large, one can write them to disk as follows: 如果数据不是太大，可以按照以下步骤将它们写入磁盘：

df.write.option("header", "true").csv("/tmp/foobar")

The given directory must not exist. 给定的目录不能存在。

查看Spark数据框的分区信息

问题描述

1 个解决方案

解决方案1
0 已采纳 2018-05-15 08:47:24

查看Spark数据框的分区信息

问题描述

1 个解决方案

解决方案1 0 已采纳 2018-05-15 08:47:24

解决方案1
0 已采纳 2018-05-15 08:47:24