简体   繁体   English

AWS Glue - 防止空导出到 S3

[英]AWS Glue - prevent empty exports to S3

How can I avoid that AWS Glue writes empty objects to S3?如何避免 AWS Glue 将空对象写入 S3?

I have a Glue Job that writes the resulting dynamic frame to S3:我有一个将生成的动态帧写入 S3 的 Glue 作业:

dynamic_frame = # result of Glue job processing

glue_context.write_dynamic_frame.from_options(
    frame = dynamic_frame,
    connection_type = 's3',
    connection_options = {'path': 's3://some-bucket/some-path'},
    format = 'json')

However, when I check the bucket content in S3, I see not just the data but also many objects that has size 0 B. How can I prevent this?但是,当我检查 S3 中的存储桶内容时,我不仅看到了数据,还看到了许多大小为 0 B 的对象。我该如何防止这种情况发生?

I have tried using the DropNullFields class (see below), but that did not help.我曾尝试使用DropNullFields类(见下文),但这没有帮助。

dynamic_frame = # result of Glue job processing

non_null_fields = DropNullFields.apply(dynamic_frame)

glue_context.write_dynamic_frame.from_options(
    frame = non_null_fields,
    connection_type = 's3',
    connection_options = {'path': 's3://some-bucket/some-path'},
    format = 'json')

Aws Glue is a wrapper around Apache Spark. Aws Glue 是 Apache Spark 的包装器。 Normally Spark writes as many files as partitions.通常 Spark 会写入与分区一样多的文件。 If it is writing empty files it means that you have empty partitions.如果它正在写入空文件,则意味着您有空分区。 The way to do it would be to repartition your dynamic_frame in smaller partitions.这样做的方法是在较小的分区中重新分区您的 dynamic_frame。 With Spark DataFrames you would use the "coalesce" function.对于 Spark DataFrames,您将使用“coalesce”功能。

In Glue you could try to use the repartition function: https://docs.aws.amazon.com/glue/latest/dg/glue-etl-scala-apis-glue-dynamicframe-class.html#glue-etl-scala-apis-glue-dynamicframe-class-defs-repartition在 Glue 中,您可以尝试使用重新分区功能: https : //docs.aws.amazon.com/glue/latest/dg/glue-etl-scala-apis-glue-dynamicframe-class.html#glue-etl-scala -apis-glue-dynamicframe-class-defs-repartition

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM