Why is write DataFrame to S3 (or write dynamic frame to Redshift) giving error after adding derived column using UDF in AWS Glue pyspark?

Question

I had an AWS Glue Job with ETL script in pyspark which wrote dynamic frame to redshift as a table and to s3 as json. One of the column in this df is status_date . I had no issue in writing this df.

I then had a requirement to add two more columns financial_year and financial_qarter based on the status_date . For this I created a udf added two new columns. Using printSchema() and show() I saw the columns were successfully created and values in the columns were collect.

The problem came when I tried to write this to aws s3 and aws redshift. It is giving a weird error which I am not able to troubleshoot.

Error if I write to Redshift - An error occurred while calling o177.pyWriteDynamicFrame. File already exists:s3://aws-glue-temporary-***********-ap-south-1/ShashwatS/491bb37a-404a-4ec5-a459-5534d94b0206/part-00002-af529a71-7315-4bd1-ace5-91ab5d9f7f46-c000.csv An error occurred while calling o177.pyWriteDynamicFrame. File already exists:s3://aws-glue-temporary-***********-ap-south-1/ShashwatS/491bb37a-404a-4ec5-a459-5534d94b0206/part-00002-af529a71-7315-4bd1-ace5-91ab5d9f7f46-c000.csv

Error if I write to s3 as json - An error occurred while calling o175.json. File already exists:s3://bucket_name/folder1/folder2/folder3/folder4/24022020124055/part-00026-fd481713-2ccc-4d23-98b0-e96908cb708c-c000.json An error occurred while calling o175.json. File already exists:s3://bucket_name/folder1/folder2/folder3/folder4/24022020124055/part-00026-fd481713-2ccc-4d23-98b0-e96908cb708c-c000.json

As you can see both the errors are similar kind. Below is the error trace. Some help required.

> org.apache.spark.SparkException: Job aborted due to stage failure:
> Task 3 in stage 503.0 failed 4 times, most recent failure: Lost task
> 3.3 in stage 503.0 (, executor 15):
> org.apache.hadoop.fs.FileAlreadyExistsException: File already
> exists:s3://bucket_name/folder1/folder2/folder3/folder4/24022020150358/part-00003-055f9327-1368-4ba7-9216-6a32afac1843-c000.json
>   at
> com.amazon.ws.emr.hadoop.fs.s3.upload.plan.RegularUploadPlanner.checkExistenceIfNotOverwriting(RegularUploadPlanner.java:36)
>   at
> com.amazon.ws.emr.hadoop.fs.s3.upload.plan.RegularUploadPlanner.plan(RegularUploadPlanner.java:30)
>   at
> com.amazon.ws.emr.hadoop.fs.s3.upload.plan.UploadPlannerChain.plan(UploadPlannerChain.java:37)
>

Answer 1

Are u trying to write to the same location in your code twice ?? Eg : If u have already written to location s3://bucket_name/folder1/folder2/folder3/folder4/24022020124055 once in your code and are trying to do it again in the same code ,it will not work.

Answer 2

I had this error and took me a couple of days to find the cause. My Issue was caused by the file format/data types of the S3 Files. Create UTF8 files or change all your data files format into UTF-8.

Or find the "bad" record by ordering the records in source S3 files by ID or some uniqueidentifier and then by looking at the last file created example s3://bucket_name/folder1/folder2/folder3/folder4/24022020150358/part-xxxx

find last record in this file and correspond that to the source file. In my case the prblm record was 3 to 7 lines past that last record that was imported. If its a special character then you need to change your file format.

Or a quick check is to just rmove that special character and recheck the output file if its gone past the previous "bad" record.

Why is write DataFrame to S3 (or write dynamic frame to Redshift) giving error after adding derived column using UDF in AWS Glue pyspark?

Question

2 answers

solution1
0 2020-02-26 17:37:49

solution2
0 2020-09-17 21:57:06

Why is write DataFrame to S3 (or write dynamic frame to Redshift) giving error after adding derived column using UDF in AWS Glue pyspark?

Question

2 answers

solution1 0 2020-02-26 17:37:49

solution2 0 2020-09-17 21:57:06

solution1
0 2020-02-26 17:37:49

solution2
0 2020-09-17 21:57:06