简体   繁体   中英

AWS Glue Spark Job - How to group S3 input files when using CatalogSource?

The AWS Glue Spark API supports grouping multiple smaller input files together ( https://docs.aws.amazon.com/en_en/glue/latest/dg/grouping-input-files.html ) which reduces tasks and partitions.

However, when using a datacatalog source through getCatalogSource with a table that in turn is backed by files stored on S3, we do not find any possibility to pass the above mentioned grouping parameters to the s3 source.

A bit of background info: Our ETL job reads many small files, processes the containing records and writes them back to S3 while retaining the original folder structure. These output records are supposed to be larger and less in quantity compared to the source.

We assume this can be achieved when reading files in groups as described above. Another way to achieve this is basically to repartition to (1) but this would also be extremely inefficient.

Are we missing something? Has somebody an idea how to achieve this efficiently? Ideally we would be able to specify the approx. output file size (which should work when setting 'groupSize': '10000', in case we understand the spec correctly) .

根据AWS支持,可以通过AWS控制台在Glue表级别上直接设置所有属性。

a. Key= groupFiles , value= inPartitio b. Key=groupSize, value=1048576 c. Key=recurse, value=True

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM