为什么在 AWS Glue pyspark 中使用 UDF 添加派生列后，将 DataFrame 写入 S3（或将动态帧写入 Redshift）会出错？

Question

I had an AWS Glue Job with ETL script in pyspark which wrote dynamic frame to redshift as a table and to s3 as json.我在 pyspark 中有一个带有 ETL 脚本的 AWS Glue 作业，它将动态帧写入 redshift 作为表，并将 s3 作为 json 写入。 One of the column in this df is status_date .此 df 中的一列是status_date 。 I had no issue in writing this df.我在写这个 df 时没有问题。

I then had a requirement to add two more columns financial_year and financial_qarter based on the status_date .然后我需要根据status_date再添加两列Financial_year和Financial_qarter 。 For this I created a udf added two new columns.为此，我创建了一个 udf，添加了两个新列。 Using printSchema() and show() I saw the columns were successfully created and values in the columns were collect.使用 printSchema() 和 show() 我看到成功创建了列并且收集了列中的值。

The problem came when I tried to write this to aws s3 and aws redshift.当我尝试将其写入 aws s3 和 aws redshift 时，问题就出现了。 It is giving a weird error which I am not able to troubleshoot.它给出了一个我无法解决的奇怪错误。

Error if I write to Redshift - An error occurred while calling o177.pyWriteDynamicFrame. File already exists:s3://aws-glue-temporary-***********-ap-south-1/ShashwatS/491bb37a-404a-4ec5-a459-5534d94b0206/part-00002-af529a71-7315-4bd1-ace5-91ab5d9f7f46-c000.csv写入 Redshift 时An error occurred while calling o177.pyWriteDynamicFrame. File already exists:s3://aws-glue-temporary-***********-ap-south-1/ShashwatS/491bb37a-404a-4ec5-a459-5534d94b0206/part-00002-af529a71-7315-4bd1-ace5-91ab5d9f7f46-c000.csv - An error occurred while calling o177.pyWriteDynamicFrame. File already exists:s3://aws-glue-temporary-***********-ap-south-1/ShashwatS/491bb37a-404a-4ec5-a459-5534d94b0206/part-00002-af529a71-7315-4bd1-ace5-91ab5d9f7f46-c000.csv An error occurred while calling o177.pyWriteDynamicFrame. File already exists:s3://aws-glue-temporary-***********-ap-south-1/ShashwatS/491bb37a-404a-4ec5-a459-5534d94b0206/part-00002-af529a71-7315-4bd1-ace5-91ab5d9f7f46-c000.csv

Error if I write to s3 as json - An error occurred while calling o175.json. File already exists:s3://bucket_name/folder1/folder2/folder3/folder4/24022020124055/part-00026-fd481713-2ccc-4d23-98b0-e96908cb708c-c000.json将 s3 作为 json 写入时出错 - An error occurred while calling o175.json. File already exists:s3://bucket_name/folder1/folder2/folder3/folder4/24022020124055/part-00026-fd481713-2ccc-4d23-98b0-e96908cb708c-c000.json An error occurred while calling o175.json. File already exists:s3://bucket_name/folder1/folder2/folder3/folder4/24022020124055/part-00026-fd481713-2ccc-4d23-98b0-e96908cb708c-c000.json

As you can see both the errors are similar kind.正如你所看到的，这两个错误是相似的。 Below is the error trace.下面是错误跟踪。 Some help required.需要一些帮助。

> org.apache.spark.SparkException: Job aborted due to stage failure:
> Task 3 in stage 503.0 failed 4 times, most recent failure: Lost task
> 3.3 in stage 503.0 (, executor 15):
> org.apache.hadoop.fs.FileAlreadyExistsException: File already
> exists:s3://bucket_name/folder1/folder2/folder3/folder4/24022020150358/part-00003-055f9327-1368-4ba7-9216-6a32afac1843-c000.json
>   at
> com.amazon.ws.emr.hadoop.fs.s3.upload.plan.RegularUploadPlanner.checkExistenceIfNotOverwriting(RegularUploadPlanner.java:36)
>   at
> com.amazon.ws.emr.hadoop.fs.s3.upload.plan.RegularUploadPlanner.plan(RegularUploadPlanner.java:30)
>   at
> com.amazon.ws.emr.hadoop.fs.s3.upload.plan.UploadPlannerChain.plan(UploadPlannerChain.java:37)
>

Answer 1

Are u trying to write to the same location in your code twice ??您是否尝试在代码中写入同一位置两次？ Eg : If u have already written to location s3://bucket_name/folder1/folder2/folder3/folder4/24022020124055 once in your code and are trying to do it again in the same code ,it will not work.例如：如果您已经在代码中写入位置 s3://bucket_name/folder1/folder2/folder3/folder4/24022020124055 一次，并且正在尝试在相同的代码中再次执行此操作，则它将不起作用。

Answer 2

I had this error and took me a couple of days to find the cause.我遇到了这个错误，花了几天时间才找到原因。 My Issue was caused by the file format/data types of the S3 Files.我的问题是由 S3 文件的文件格式/数据类型引起的。 Create UTF8 files or change all your data files format into UTF-8.创建 UTF8 文件或将所有数据文件格式更改为 UTF-8。

Or find the "bad" record by ordering the records in source S3 files by ID or some uniqueidentifier and then by looking at the last file created example s3://bucket_name/folder1/folder2/folder3/folder4/24022020150358/part-xxxx或者通过按 ID 或某些唯一标识符对源 S3 文件中的记录进行排序，然后查看最后创建的文件示例 s3://bucket_name/folder1/folder2/folder3/folder4/24022020150358/part-xxxx 来查找“坏”记录

find last record in this file and correspond that to the source file.找到此文件中的最后一条记录并将其与源文件相对应。 In my case the prblm record was 3 to 7 lines past that last record that was imported.在我的例子中，prblm 记录比导入的最后一条记录多 3 到 7 行。 If its a special character then you need to change your file format.如果它是一个特殊字符，那么您需要更改文件格式。

Or a quick check is to just rmove that special character and recheck the output file if its gone past the previous "bad" record.或者快速检查只是删除该特殊字符并重新检查输出文件，如果它超过了之前的“坏”记录。

为什么在 AWS Glue pyspark 中使用 UDF 添加派生列后，将 DataFrame 写入 S3（或将动态帧写入 Redshift）会出错？

问题描述

2 个解决方案

解决方案1
0 2020-02-26 17:37:49

解决方案2
0 2020-09-17 21:57:06

为什么在 AWS Glue pyspark 中使用 UDF 添加派生列后，将 DataFrame 写入 S3（或将动态帧写入 Redshift）会出错？

问题描述

2 个解决方案

解决方案1 0 2020-02-26 17:37:49

解决方案2 0 2020-09-17 21:57:06

解决方案1
0 2020-02-26 17:37:49

解决方案2
0 2020-09-17 21:57:06