简体   繁体   English

AWS Glue Spark作业-使用CatalogSource时如何对S3输入文件进行分组?

[英]AWS Glue Spark Job - How to group S3 input files when using CatalogSource?

The AWS Glue Spark API supports grouping multiple smaller input files together ( https://docs.aws.amazon.com/en_en/glue/latest/dg/grouping-input-files.html ) which reduces tasks and partitions. AWS Glue Spark API支持将多个较小的输入文件分组在一起( https://docs.aws.amazon.com/en_en/glue/latest/dg/grouping-input-files.html ),从而减少了任务和分区。

However, when using a datacatalog source through getCatalogSource with a table that in turn is backed by files stored on S3, we do not find any possibility to pass the above mentioned grouping parameters to the s3 source. 但是,当使用通过getCatalogSource的datacatalog源以及一个表(该表又由S3中存储的文件支持)时,我们找不到将上述分组参数传递给s3源的任何可能性。

A bit of background info: Our ETL job reads many small files, processes the containing records and writes them back to S3 while retaining the original folder structure. 一些背景信息:我们的ETL作业读取许多小文件,处理包含的记录,并将它们写回到S3,同时保留原始文件夹结构。 These output records are supposed to be larger and less in quantity compared to the source. 与源相比,这些输出记录应该更大或更小。

We assume this can be achieved when reading files in groups as described above. 我们假设如上所述按组读取文件时可以实现。 Another way to achieve this is basically to repartition to (1) but this would also be extremely inefficient. 实现此目的的另一种方法基本上是重新分配给(1),但这也将极其无效。

Are we missing something? 我们错过了什么吗? Has somebody an idea how to achieve this efficiently? 有人知道如何有效地做到这一点吗? Ideally we would be able to specify the approx. 理想情况下,我们可以指定近似值。 output file size (which should work when setting 'groupSize': '10000', in case we understand the spec correctly) . 输出文件大小(如果我们正确理解规范,在设置'groupSize':'10000'时应该可以使用)。

根据AWS支持,可以通过AWS控制台在Glue表级别上直接设置所有属性。

a. Key= groupFiles , value= inPartitio b. Key=groupSize, value=1048576 c. Key=recurse, value=True

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 使用 AWS 胶水合并多个 parquet 文件并在 s3 中创建一个更大的 parquet 文件 - Merging multiple parquet files and creating a larger parquet file in s3 using AWS glue 如何在 Scala 的 Glue Job 中从 S3 文件创建动态数据帧? - How to create dynamic data frame from S3 files in Glue Job in Scala? 如何从火花数据帧中的 AWS S3 读取多个文件? - How to read multiple files from AWS S3 in spark dataframe? 无法使用 Apache Spark 在 AWS Glue 中读取 json 个文件 - Unable to read json files in AWS Glue using Apache Spark 如何使用Spark捆绑S3中的许多文件 - How to bundle many files in S3 using Spark 使用Datastax Spark中的Scala将文件从S3存储桶读取到Spark Dataframe,并提交给AWS错误消息:错误的请求 - Read Files from S3 bucket to Spark Dataframe using Scala in Datastax Spark Submit giving AWS Error Message: Bad Request AWS Glue Spark 作业书签会重新处理失败的作业吗? - Will AWS Glue Spark Job Bookmark reprocess failed jobs? 使用Spark / Scal应用访问s3时找不到AWS凭证 - AWS credentials not found when using spark/scal app to access s3 如何结束或失败 AWS Glue 作业并出现错误? - How to end or fail AWS Glue job with error? 如何在不使用 spark 的情况下从 AWS EMR 内部读取 S3 存储桶中的文本文件 - How to read a text file in S3 bucket from inside an AWS EMR without using spark
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM