简体   繁体   English

为什么在 AWS Glue pyspark 中使用 UDF 添加派生列后,将 DataFrame 写入 S3(或将动态帧写入 Redshift)会出错?

[英]Why is write DataFrame to S3 (or write dynamic frame to Redshift) giving error after adding derived column using UDF in AWS Glue pyspark?

I had an AWS Glue Job with ETL script in pyspark which wrote dynamic frame to redshift as a table and to s3 as json.我在 pyspark 中有一个带有 ETL 脚本的 AWS Glue 作业,它将动态帧写入 redshift 作为表,并将 s3 作为 json 写入。 One of the column in this df is status_date .此 df 中的一列是status_date I had no issue in writing this df.我在写这个 df 时没有问题。

I then had a requirement to add two more columns financial_year and financial_qarter based on the status_date .然后我需要根据status_date再添加两列Financial_yearFinancial_qarter For this I created a udf added two new columns.为此,我创建了一个 udf,添加了两个新列。 Using printSchema() and show() I saw the columns were successfully created and values in the columns were collect.使用 printSchema() 和 show() 我看到成功创建了列并且收集了列中的值。

The problem came when I tried to write this to aws s3 and aws redshift.当我尝试将其写入 aws s3 和 aws redshift 时,问题就出现了。 It is giving a weird error which I am not able to troubleshoot.它给出了一个我无法解决的奇怪错误。

Error if I write to Redshift - An error occurred while calling o177.pyWriteDynamicFrame. File already exists:s3://aws-glue-temporary-***********-ap-south-1/ShashwatS/491bb37a-404a-4ec5-a459-5534d94b0206/part-00002-af529a71-7315-4bd1-ace5-91ab5d9f7f46-c000.csv写入 Redshift 时An error occurred while calling o177.pyWriteDynamicFrame. File already exists:s3://aws-glue-temporary-***********-ap-south-1/ShashwatS/491bb37a-404a-4ec5-a459-5534d94b0206/part-00002-af529a71-7315-4bd1-ace5-91ab5d9f7f46-c000.csv - An error occurred while calling o177.pyWriteDynamicFrame. File already exists:s3://aws-glue-temporary-***********-ap-south-1/ShashwatS/491bb37a-404a-4ec5-a459-5534d94b0206/part-00002-af529a71-7315-4bd1-ace5-91ab5d9f7f46-c000.csv An error occurred while calling o177.pyWriteDynamicFrame. File already exists:s3://aws-glue-temporary-***********-ap-south-1/ShashwatS/491bb37a-404a-4ec5-a459-5534d94b0206/part-00002-af529a71-7315-4bd1-ace5-91ab5d9f7f46-c000.csv

Error if I write to s3 as json - An error occurred while calling o175.json. File already exists:s3://bucket_name/folder1/folder2/folder3/folder4/24022020124055/part-00026-fd481713-2ccc-4d23-98b0-e96908cb708c-c000.json将 s3 作为 json 写入时出错 - An error occurred while calling o175.json. File already exists:s3://bucket_name/folder1/folder2/folder3/folder4/24022020124055/part-00026-fd481713-2ccc-4d23-98b0-e96908cb708c-c000.json An error occurred while calling o175.json. File already exists:s3://bucket_name/folder1/folder2/folder3/folder4/24022020124055/part-00026-fd481713-2ccc-4d23-98b0-e96908cb708c-c000.json

As you can see both the errors are similar kind.正如你所看到的,这两个错误是相似的。 Below is the error trace.下面是错误跟踪。 Some help required.需要一些帮助。

> org.apache.spark.SparkException: Job aborted due to stage failure:
> Task 3 in stage 503.0 failed 4 times, most recent failure: Lost task
> 3.3 in stage 503.0 (, executor 15):
> org.apache.hadoop.fs.FileAlreadyExistsException: File already
> exists:s3://bucket_name/folder1/folder2/folder3/folder4/24022020150358/part-00003-055f9327-1368-4ba7-9216-6a32afac1843-c000.json
>   at
> com.amazon.ws.emr.hadoop.fs.s3.upload.plan.RegularUploadPlanner.checkExistenceIfNotOverwriting(RegularUploadPlanner.java:36)
>   at
> com.amazon.ws.emr.hadoop.fs.s3.upload.plan.RegularUploadPlanner.plan(RegularUploadPlanner.java:30)
>   at
> com.amazon.ws.emr.hadoop.fs.s3.upload.plan.UploadPlannerChain.plan(UploadPlannerChain.java:37)
>   

Are u trying to write to the same location in your code twice ??您是否尝试在代码中写入同一位置两次? Eg : If u have already written to location s3://bucket_name/folder1/folder2/folder3/folder4/24022020124055 once in your code and are trying to do it again in the same code ,it will not work.例如:如果您已经在代码中写入位置 s3://bucket_name/folder1/folder2/folder3/folder4/24022020124055 一次,并且正在尝试在相同的代码中再次执行此操作,则它将不起作用。

I had this error and took me a couple of days to find the cause.我遇到了这个错误,花了几天时间才找到原因。 My Issue was caused by the file format/data types of the S3 Files.我的问题是由 S3 文件的文件格式/数据类型引起的。 Create UTF8 files or change all your data files format into UTF-8.创建 UTF8 文件或将所有数据文件格式更改为 UTF-8。

Or find the "bad" record by ordering the records in source S3 files by ID or some uniqueidentifier and then by looking at the last file created example s3://bucket_name/folder1/folder2/folder3/folder4/24022020150358/part-xxxx或者通过按 ID 或某些唯一标识符对源 S3 文件中的记录进行排序,然后查看最后创建的文件示例 s3://bucket_name/folder1/folder2/folder3/folder4/24022020150358/part-xxxx 来查找“坏”记录

find last record in this file and correspond that to the source file.找到此文件中的最后一条记录并将其与源文件相对应。 In my case the prblm record was 3 to 7 lines past that last record that was imported.在我的例子中,prblm 记录比导入的最后一条记录多 3 到 7 行。 If its a special character then you need to change your file format.如果它是一个特殊字符,那么您需要更改文件格式。

Or a quick check is to just rmove that special character and recheck the output file if its gone past the previous "bad" record.或者快速检查只是删除该特殊字符并重新检查输出文件,如果它超过了之前的“坏”记录。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 使用 AWS Glue 将 ACL 权限写入 S3 中的 write_dynamic_frame_from_options - ACL permissions for write_dynamic_frame_from_options in to S3 using AWS Glue 写一个 spark dataframe 或者写一个胶水动态框架,AWS Glue 哪个选项更好? - write a spark dataframe or write a glue dynamic frame, which option is better in AWS Glue? AWS Glue 作业以 Parquet 格式错误写入 s3,但未找到 - AWS Glue job write to s3 in parquet format error with Not Found 使用AWS Glue将JSON文件写入S3存储桶 - Using AWS Glue to write a JSON file to an S3 bucket 在 AWS glue dynamic dataframe 中添加一列 - Adding a column in AWS glue dynamic dataframe 如何根据 AWS Glue 作业中 dataframe 的不同值写入多个 S3 存储桶? - How to write to multiple S3 buckets based on distinct values of a dataframe in an AWS Glue job? AWS Glue 作业无法写入 Redshift - AWS Glue job fails to write to Redshift AWS。 通过Glue将数据从S3写入Elasticsearch - AWS. Write data from S3 to Elasticsearch through Glue 使用glueContext.write_dynamic_frame.from_options 的AWS Glue 导出到镶木地板问题 - AWS Glue export to parquet issue using glueContext.write_dynamic_frame.from_options 将 Pandas 数据帧写入 s3 AWS 中的镶木地板 - Write pandas dataframe to parquet in s3 AWS
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM