简体繁体中英

AWS Glue Spark Job - How to group S3 input files when using CatalogSource?

原文 2019-06-20 20:48:56 1 1 scala/ amazon-web-services/ apache-spark/ amazon-s3/ glue

The AWS Glue Spark API supports grouping multiple smaller input files together ( https://docs.aws.amazon.com/en_en/glue/latest/dg/grouping-input-files.html ) which reduces tasks and partitions.

However, when using a datacatalog source through getCatalogSource with a table that in turn is backed by files stored on S3, we do not find any possibility to pass the above mentioned grouping parameters to the s3 source.

A bit of background info: Our ETL job reads many small files, processes the containing records and writes them back to S3 while retaining the original folder structure. These output records are supposed to be larger and less in quantity compared to the source.

We assume this can be achieved when reading files in groups as described above. Another way to achieve this is basically to repartition to (1) but this would also be extremely inefficient.

Are we missing something? Has somebody an idea how to achieve this efficiently? Ideally we would be able to specify the approx. output file size (which should work when setting 'groupSize': '10000', in case we understand the spec correctly) .

1 answers

根据AWS支持，可以通过AWS控制台在Glue表级别上直接设置所有属性。

a. Key= groupFiles , value= inPartitio b. Key=groupSize, value=1048576 c. Key=recurse, value=True

Merging multiple parquet files and creating a larger parquet file in s3 using AWS glue

How to create dynamic data frame from S3 files in Glue Job in Scala?

How to read multiple files from AWS S3 in spark dataframe?

Unable to read json files in AWS Glue using Apache Spark

How to bundle many files in S3 using Spark

Read Files from S3 bucket to Spark Dataframe using Scala in Datastax Spark Submit giving AWS Error Message: Bad Request

Will AWS Glue Spark Job Bookmark reprocess failed jobs?

AWS credentials not found when using spark/scal app to access s3

How to end or fail AWS Glue job with error?

How to read a text file in S3 bucket from inside an AWS EMR without using spark

暂无

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

Related Question Merging multiple parquet files and creating a larger parquet file in s3 using AWS glue How to create dynamic data frame from S3 files in Glue Job in Scala? How to read multiple files from AWS S3 in spark dataframe? Unable to read json files in AWS Glue using Apache Spark How to bundle many files in S3 using Spark Read Files from S3 bucket to Spark Dataframe using Scala in Datastax Spark Submit giving AWS Error Message: Bad Request Will AWS Glue Spark Job Bookmark reprocess failed jobs? AWS credentials not found when using spark/scal app to access s3 How to end or fail AWS Glue job with error? How to read a text file in S3 bucket from inside an AWS EMR without using spark

Related Tags

AWS Glue Spark Job - How to group S3 input files when using CatalogSource?

Question

1 answers

solution1 0 2019-06-24 08:32:22

solution1
0 2019-06-24 08:32:22