简体繁体 English

胶水作业无法写入文件

[英]Glue Job fails to write file

原文 2019-07-16 15:54:13 7 1 amazon-web-services/ amazon-s3/ pyspark/ aws-glue

I am back filling some data via glue jobs. 我通过胶水作业回填一些数据。 The job itself is reading in a TSV from s3, transforming the data slightly, and writing it in Parquet to S3. 作业本身正在从s3中读取TSV，对数据进行少量转换，然后将其以Parquet形式写入S3。 Since I already have the data, I am trying to launch multiple jobs at once to reduce the amount of time needed to process it all. 由于已经有了数据，因此我试图一次启动多个作业，以减少处理所有作业所需的时间。 When I launch multiple jobs at the same time, I run into an issue sometimes where one of the files will fail to output the resultant Parquet files in S3. 当我同时启动多个作业时，有时会遇到一个问题，其中一个文件将无法在S3中输出生成的Parquet文件。 The job itself completes successfully without throwing an error When I rerun the job as a non-parallel task, the file it output correctly. 作业本身成功完成，没有引发错误。当我将作业作为非并行任务重新运行时，它将正确输出文件。 Is there some issue, either with glue(or the underlying spark) or S3 that would cause my issue? 是否存在胶水（或潜在火花）或S3引起我问题的问题？

1 个解决方案

The same Glue job running in parallel may produce files with the same names and therefore some of them can be overwritten. 并行运行的同一Glue作业可能会产生具有相同名称的文件，因此其中一些文件可能会被覆盖。 As I remember correctly, transformation-context is used as part of the name. 我记得正确，转换上下文用作名称的一部分。 I assume you don't have bookmarking enabled so it should be safe for you to generate transformation-context value dynamically to ensure it's unique for each job. 我假设您没有启用书签，因此可以安全地动态生成转换上下文值，以确保它对于每个作业都是唯一的。