简体   繁体   中英

repartitioning by multiple columns for Pyspark dataframe

EDIT: adding more context to the question now that I reread the post again:

Let's say I have a pyspark dataframe that I am working with and currently I can repartition the dataframe as such:

dataframe.repartition(200, col_name)

And I write that partitioned dataframe out to a parquet file. When reading the directory, I see that the directory in the warehouse is partitioned the way I want:

/apps/hive/warehouse/db/DATE/col_name=1
/apps/hive/warehouse/db/DATE/col_name=2

I want to understand how I can repartition this in multiple layers, meaning I partition one column for the top level partition, a second column for the second level partition, and a third column for the third level partition. Is it as easy as adding a partitionBy() to a write method?

dataframe.mode("overwrite").partitionBy("col_name1","col_name2","col_name3")

Thus creating the directories as such?

/apps/hive/warehouse/db/DATE/col_name1=1
|--------------------------------------->/col_name2=1
|--------------------------------------------------->/col_name3=1

If so, can I use a partitionBy() to write out a max number of files per partition?

Repartition

Function repartition will control memory partition of data. If you specify repartition as 200 then in memory you will have 200 partitions.

Physical Partition on file system

Function partitionBy with given columns list control directory structure. Physical partitions will be created based on column name and column value. Each partition can create as many files as specified in repartition (default 200) will be created provided you have enough data to write.

This is sample example based on your question.

dataframe.
repartition(200).
write.mode("overwrite").
partitionBy("col_name1","col_name2","col_name3")

It will give 200 files in each partition and partitions will be created based on given order.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM