简体   繁体   English

查看Spark数据框的分区信息

[英]See information of partitions of a Spark Dataframe

One can have an array of partitions of a Spark DataFrame as follows: 可以具有一个Spark DataFrame的分区数组,如下所示:

> df.rdd.partitions

Is there a way to get more information about partitions? 有没有办法获取有关分区的更多信息? In particular, I would like to see the partition key and the partition boundaries (first and last element within a partition). 特别是,我想查看分区键和分区边界(分区中的第一个和最后一个元素)。

This is just for better understanding of how the data is organized. 这只是为了更好地理解数据的组织方式。

This is what I tried: 这是我尝试的:

> df.partitions.rdd.head

But this object only has attributes and methods equals hashCode and index . 但是此对象仅具有equals hashCodeindex属性和方法。

In case the data is not too large, one can write them to disk as follows: 如果数据不是太大,可以按照以下步骤将它们写入磁盘:

df.write.option("header", "true").csv("/tmp/foobar")

The given directory must not exist. 给定的目录不能存在。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM