[英]In Foundry, how can I Hive partition with only 1 parquet file per value?
I'm looking to improve the performance on running filtering logic.我希望提高运行过滤逻辑的性能。 To accomplish this, the idea is to do hive partitioning setting by setting the partition column to a column in the dataset (called
splittable_column
).为此,想法是通过将分区列设置为数据集中的列(称为
splittable_column
)来进行配置单元分区设置。
I checked and the cardinality of splittable column is low, and if I subset each value from splitting_column
, the end result is a 800MB parquet file.我检查了可
splitting_column
列的基数很低,如果我从 split_column 中对每个值进行子集化,最终结果是一个 800MB 的镶木地板文件。
If the cardinality of my dataset is 3, my goal is to have the data laid out like:如果我的数据集的基数是 3,我的目标是让数据布局如下:
spark/splittable_column=Value A/part-00000-abc.c000.snappy.parquet
spark/splittable_column=Value B/part-00000-def.c000.snappy.parquet
spark/splittable_column=Value C/part-00000-ghi.c000.snappy.parquet
When I run my_output_df.write_dataframe(df_with_logic,partition_cols=["splittable_column"])
, and look at the results, I see many files in the KB range within the directory, which is going to cause a large overhead during reading.当我运行
my_output_df.write_dataframe(df_with_logic,partition_cols=["splittable_column"])
并查看结果时,我看到目录中 KB 范围内的许多文件,这将在读取过程中造成很大的开销。 For example my dataset looks like:例如,我的数据集如下所示:
spark/splittable_column=Value A/part-00000-abc.c000.snappy.parquet
spark/splittable_column=Value A/part-00001-abc.c000.snappy.parquet
spark/splittable_column=Value A/part-00002-abc.c000.snappy.parquet
...
spark/splittable_column=Value A/part-00033-abc.c000.snappy.parquet
spark/splittable_column=Value B/part-00000-def.c000.snappy.parquet
...
spark/splittable_column=Value B/part-00030-def.c000.snappy.parquet
spark/splittable_column=Value C/part-00000-ghi.c000.snappy.parquet
...
spark/splittable_column=Value C/part-00032-ghi.c000.snappy.parquet
etc.
From the documentation I understand that:从文档中我了解到:
you will have at least one output file for each unique value in your partition column
对于分区列中的每个唯一值,您将至少有一个输出文件
How do I configure the transform that I get at most 1 output file per task during Hive partitioning?如何配置在 Hive 分区期间每个任务最多获得 1 个输出文件的转换?
If you look at the input data, you may notice that the data is split across multiple parquet files.如果您查看输入数据,您可能会注意到数据被拆分为多个 parquet 文件。 When you look at the build report for just running
my_output_df.write_dataframe(df_with_logic,partition_cols=["splittable_column"])
, you may notice that there is no shuffle in the query plan.当您查看仅运行
my_output_df.write_dataframe(df_with_logic,partition_cols=["splittable_column"])
的构建报告时,您可能会注意到查询计划中没有随机播放。
IE, you would see: IE,你会看到:
Graph:图形:
Scan
Project
BasicStats
Execute
Plan:计划:
FoundrySaveDatasetCommand `ri.foundry.main.transaction.xxx@master`.`ri.foundry.main.dataset.yyy`, ErrorIfExists, [column1 ... 17 more fields],
+- BasicStatsNode `ri.foundry.main.transaction.zzz@master`.`ri.foundry.main.dataset.aaa`
+- Project [splitable_column ... 17 more fields]
+- Relation !ri.foundry.main.transaction.xxx:master.ri.foundry.main.dataset.yyy[splittable_column... 17 more fields] parquet
In this example, it only took 1 minute to run because there was no shuffle.在这个例子中,运行只需要 1 分钟,因为没有随机播放。
Now if you repartition on the column you are going to partition by:现在,如果您在列上重新分区,您将按以下方式进行分区:
df_with_logic = df_with_logic.repartition("splittable_column")
my_output_df.write_dataframe(df_with_logic,partition_cols=["splittable_column"]
It will force an Exchange
, ie RepartitionByExpression
on splittable_column
, which will take longer (15 minutes in my case) but the data will be split the way I wanted:它将强制
Exchange
,即在splittable_column
上的RepartitionByExpression
,这将需要更长的时间(在我的情况下为 15 分钟),但数据将按照我想要的方式拆分:
spark/splittable_column=Value A/part-00000-abc.c000.snappy.parquet
spark/splittable_column=Value B/part-00000-def.c000.snappy.parquet
spark/splittable_column=Value C/part-00000-ghi.c000.snappy.parquet
Graph:图形:
Scan
Exchange
Project
BasicStats
Execute
Plan:计划:
ri.foundry.main.transaction.xxx@master`.`ri.foundry.main.dataset.yyy`, ErrorIfExists, [column1 ... 17 more fields],
+- BasicStatsNode `ri.foundry.main.transaction.zzz@master`.`ri.foundry.main.dataset.aaa`
+- Project [splitable_column ... 17 more fields]
+- RepartitionByExpression [splittable_column], 1
+- Relation !ri.foundry.main.transaction.xxx:master.ri.foundry.main.dataset.yyy[splittable_column... 17 more fields] parquet
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.