简体   繁体   English

在 Foundry 中,如何使用每个值只有 1 个镶木地板文件的 Hive 分区?

[英]In Foundry, how can I Hive partition with only 1 parquet file per value?

I'm looking to improve the performance on running filtering logic.我希望提高运行过滤逻辑的性能。 To accomplish this, the idea is to do hive partitioning setting by setting the partition column to a column in the dataset (called splittable_column ).为此,想法是通过将分区列设置为数据集中的列(称为splittable_column )来进行配置单元分区设置。

I checked and the cardinality of splittable column is low, and if I subset each value from splitting_column , the end result is a 800MB parquet file.我检查了可splitting_column列的基数很低,如果我从 split_column 中对每个值进行子集化,最终结果是一个 800MB 的镶木地板文件。

If the cardinality of my dataset is 3, my goal is to have the data laid out like:如果我的数据集的基数是 3,我的目标是让数据布局如下:

spark/splittable_column=Value A/part-00000-abc.c000.snappy.parquet  
spark/splittable_column=Value B/part-00000-def.c000.snappy.parquet  
spark/splittable_column=Value C/part-00000-ghi.c000.snappy.parquet  

When I run my_output_df.write_dataframe(df_with_logic,partition_cols=["splittable_column"]) , and look at the results, I see many files in the KB range within the directory, which is going to cause a large overhead during reading.当我运行my_output_df.write_dataframe(df_with_logic,partition_cols=["splittable_column"])并查看结果时,我看到目录中 KB 范围内的许多文件,这将在读取过程中造成很大的开销。 For example my dataset looks like:例如,我的数据集如下所示:

spark/splittable_column=Value A/part-00000-abc.c000.snappy.parquet
spark/splittable_column=Value A/part-00001-abc.c000.snappy.parquet  
spark/splittable_column=Value A/part-00002-abc.c000.snappy.parquet  
...
spark/splittable_column=Value A/part-00033-abc.c000.snappy.parquet  
spark/splittable_column=Value B/part-00000-def.c000.snappy.parquet
...
spark/splittable_column=Value B/part-00030-def.c000.snappy.parquet
spark/splittable_column=Value C/part-00000-ghi.c000.snappy.parquet
...
spark/splittable_column=Value C/part-00032-ghi.c000.snappy.parquet
etc.

From the documentation I understand that:从文档中我了解到:

you will have at least one output file for each unique value in your partition column对于分区列中的每个唯一值,您将至少有一个输出文件

How do I configure the transform that I get at most 1 output file per task during Hive partitioning?如何配置在 Hive 分区期间每个任务最多获得 1 个输出文件的转换?

If you look at the input data, you may notice that the data is split across multiple parquet files.如果您查看输入数据,您可能会注意到数据被拆分为多个 parquet 文件。 When you look at the build report for just running my_output_df.write_dataframe(df_with_logic,partition_cols=["splittable_column"]) , you may notice that there is no shuffle in the query plan.当您查看仅运行my_output_df.write_dataframe(df_with_logic,partition_cols=["splittable_column"])的构建报告时,您可能会注意到查询计划中没有随机播放。

IE, you would see: IE,你会看到:

Graph:图形:

Scan
Project
BasicStats
Execute

Plan:计划:

FoundrySaveDatasetCommand `ri.foundry.main.transaction.xxx@master`.`ri.foundry.main.dataset.yyy`, ErrorIfExists, [column1 ... 17 more fields],
+- BasicStatsNode `ri.foundry.main.transaction.zzz@master`.`ri.foundry.main.dataset.aaa`
   +- Project [splitable_column ... 17 more fields]
      +- Relation !ri.foundry.main.transaction.xxx:master.ri.foundry.main.dataset.yyy[splittable_column... 17 more fields] parquet

In this example, it only took 1 minute to run because there was no shuffle.在这个例子中,运行只需要 1 分钟,因为没有随机播放。

Now if you repartition on the column you are going to partition by:现在,如果您在列上重新分区,您将按以下方式进行分区:

df_with_logic = df_with_logic.repartition("splittable_column")
my_output_df.write_dataframe(df_with_logic,partition_cols=["splittable_column"]

It will force an Exchange , ie RepartitionByExpression on splittable_column , which will take longer (15 minutes in my case) but the data will be split the way I wanted:它将强制Exchange ,即在splittable_column上的RepartitionByExpression ,这将需要更长的时间(在我的情况下为 15 分钟),但数据将按照我想要的方式拆分:

spark/splittable_column=Value A/part-00000-abc.c000.snappy.parquet  
spark/splittable_column=Value B/part-00000-def.c000.snappy.parquet  
spark/splittable_column=Value C/part-00000-ghi.c000.snappy.parquet  

Graph:图形:

Scan
Exchange
Project
BasicStats
Execute

Plan:计划:

ri.foundry.main.transaction.xxx@master`.`ri.foundry.main.dataset.yyy`, ErrorIfExists, [column1 ... 17 more fields],
+- BasicStatsNode `ri.foundry.main.transaction.zzz@master`.`ri.foundry.main.dataset.aaa`
   +- Project [splitable_column ... 17 more fields]
      +- RepartitionByExpression [splittable_column], 1
          +- Relation !ri.foundry.main.transaction.xxx:master.ri.foundry.main.dataset.yyy[splittable_column... 17 more fields] parquet

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 Pyspark-如何使用分隔符将实木复合地板文件转换为文本文件 - Pyspark - How can I convert parquet file to text file with delimiter 如何使用 Spark 将镶木地板文件加载到 Hive 表中? - How to load a parquet file into a Hive Table using Spark? 如何打开 Keras 的大型镶木地板文件? - How can I open a large parquet file for Keras? 我怎么知道镶木地板文件块大小? - How can I know a parquet file block size? 如何将 RDD 保存到单个镶木地板文件? - How can i save RDD to a single parquet file? 如何使用 Spark (pyspark) 编写镶木地板文件? - How can I write a parquet file using Spark (pyspark)? 如何在我的本地启用 hive 动态分区 pyspark session - How can I enable hive dynamic partition in my local pyspark session 为什么我不能使用“cat file1.parquet file2.parquet > result.parquet”合并多个镶木地板文件? - Why can't I merge multiple parquet files using "cat file1.parquet file2. parquet > result.parquet"? 分区拼花文件上的 Spark 持久视图 - Spark persistent view on a partition parquet file 如何在我的 Foundry Magritte 数据集导出中获得漂亮的文件名和高效的存储使用? - How can I have nice file names & efficient storage usage in my Foundry Magritte dataset export?
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM