为 Pyspark 数据框按多列重新分区

Question

EDIT: adding more context to the question now that I reread the post again:编辑：现在我再次重读了这篇文章，为这个问题添加了更多的上下文：

Let's say I have a pyspark dataframe that I am working with and currently I can repartition the dataframe as such:假设我有一个正在使用的 pyspark 数据框，目前我可以像这样重新分区数据框：

dataframe.repartition(200, col_name)

And I write that partitioned dataframe out to a parquet file.我将该分区数据帧写入镶木地板文件。 When reading the directory, I see that the directory in the warehouse is partitioned the way I want:读取目录时，看到仓库中的目录按照我想要的方式进行了分区：

/apps/hive/warehouse/db/DATE/col_name=1
/apps/hive/warehouse/db/DATE/col_name=2

I want to understand how I can repartition this in multiple layers, meaning I partition one column for the top level partition, a second column for the second level partition, and a third column for the third level partition.我想了解如何在多个层中重新分区，这意味着我将一列分区为顶层分区，第二列分区为二级分区，第三列分区为三级分区。 Is it as easy as adding a partitionBy() to a write method?是否像将 partitionBy() 添加到 write 方法一样简单？

dataframe.mode("overwrite").partitionBy("col_name1","col_name2","col_name3")

Thus creating the directories as such?因此创建目录？

/apps/hive/warehouse/db/DATE/col_name1=1
|--------------------------------------->/col_name2=1
|--------------------------------------------------->/col_name3=1

If so, can I use a partitionBy() to write out a max number of files per partition?如果是这样，我可以使用 partitionBy() 写出每个分区的最大文件数吗？

Answer 1

Repartition重新分区

Function repartition will control memory partition of data.函数重新分区将控制数据的内存分区。 If you specify repartition as 200 then in memory you will have 200 partitions.如果您将 repartition 指定为 200，那么在内存中您将有 200 个分区。

Physical Partition on file system文件系统上的物理分区

Function partitionBy with given columns list control directory structure.函数 partitionBy 与给定的列列表控制目录结构。 Physical partitions will be created based on column name and column value.物理分区将根据列名和列值创建。 Each partition can create as many files as specified in repartition (default 200) will be created provided you have enough data to write.如果您有足够的数据可以写入，每个分区可以创建与 repartition 中指定的一样多的文件（默认为 200）。

This is sample example based on your question.这是基于您的问题的示例示例。

dataframe.
repartition(200).
write.mode("overwrite").
partitionBy("col_name1","col_name2","col_name3")

It will give 200 files in each partition and partitions will be created based on given order.它将在每个分区中提供 200 个文件，并且将根据给定的顺序创建分区。

为 Pyspark 数据框按多列重新分区

问题描述

1 个解决方案

解决方案1
3 已采纳 2020-11-03 20:47:22

为 Pyspark 数据框按多列重新分区

问题描述

1 个解决方案

解决方案1 3 已采纳 2020-11-03 20:47:22

解决方案1
3 已采纳 2020-11-03 20:47:22