简体   繁体   English

如何确保在 Foundry Python Transforms 中构建的数据集中文件大小一致?

[英]How do I ensure consistent file sizes in datasets built in Foundry Python Transforms?

My Foundry transform is producing different amount of data on different runs, but I want to have similar amount of rows in each file.我的 Foundry 转换在不同的运行中产生不同数量的数据,但我希望每个文件中的行数相似。 I can use DataFrame.count() and then coalesce/repartition, but that requires computing the full dataset and then either caching or recomputing it again.我可以使用DataFrame.count()然后合并/重新分区,但这需要计算完整的数据集,然后再次缓存或重新计算它。 Is there a way for Spark to take care of this? Spark有办法解决这个问题吗?

You can use spark.sql.files.maxRecordsPerFile configuration option by setting it per output of @transform:您可以使用 spark.sql.files.maxRecordsPerFile 配置选项,方法是按照 @transform 的 output 设置它:

output.write_dataframe(
    output_df,
    options={"maxRecordsPerFile": "1000000"},
)

proggeo 's answer is useful if the only thing you care about is the number of records per file.如果您唯一关心的是每个文件的记录数, proggeo的答案很有用。 However, sometimes it is useful to bucket your data so Foundry is able to optimize downstream operations like Contour Analysis or other transforms.但是,有时对数据进行分桶很有用,因此 Foundry 能够优化下游操作,例如轮廓分析或其他转换。

In those cases you can use something like:在这些情况下,您可以使用以下内容:

bucket_column = 'equipment_number'
num_files = 8
output_df = output_df.repartition(num_files, bucket_column)
output.write_dataframe(
    output_df,
    bucket_cols=[bucket_column],
    bucket_count=num_files,
)

If your bucket column is well distributed this will work to keep a similar number of rows per dataset file.如果您的存储桶列分布良好,这将有助于保持每个数据集文件的行数相似。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM