简体   繁体   English

使用glueContext.write_dynamic_frame.from_options 的AWS Glue 导出到镶木地板问题

[英]AWS Glue export to parquet issue using glueContext.write_dynamic_frame.from_options

I have the following problem.我有以下问题。

The code below is auto-generated by AWS Glue.以下代码由 AWS Glue 自动生成。

It's mission is to data from Athena (backed up by .csv @ S3) and transform data into Parquet.它的任务是从 Athena 获取数据(由 .csv @ S3 备份)并将数据转换为 Parquet。

The code is working for the reference flight dataset and for some relatively big tables (~100 Gb).该代码适用于参考航班数据集和一些相对较大的表(~100 Gb)。

However, in most cases it returns the error, which does not tell me much.但是,在大多数情况下,它返回错误,这并没有告诉我太多。

import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkConf, SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job

## @params: [JOB_NAME]
args = getResolvedOptions(sys.argv, ['JOB_NAME'])

conf = (SparkConf()
    .set("spark.driver.maxResultSize", "8g"))

sc = SparkContext(conf=conf)
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)
job.init(args['JOB_NAME'], args)

datasource0 = glueContext.create_dynamic_frame.from_catalog(database = "XXX", table_name = "csv_impressions", transformation_ctx = "datasource0")

applymapping1 = ApplyMapping.apply(frame = datasource0, mappings = [("event time", "long", "event_time", "long"), ("user id", "string", "user_id", "string"), ("advertiser id", "long", "advertiser_id", "long"), ("campaign id", "long", "campaign_id", "long")], transformation_ctx = "applymapping1")

resolvechoice2 = ResolveChoice.apply(frame = applymapping1, choice = "make_struct", transformation_ctx = "resolvechoice2")

dropnullfields3 = DropNullFields.apply(frame = resolvechoice2, transformation_ctx = "dropnullfields3")

datasink4 = glueContext.write_dynamic_frame.from_options(frame = dropnullfields3, connection_type = "s3", connection_options = {"path": "s3://xxxx"}, format = "parquet", transformation_ctx = "datasink4")
job.commit()

The error message identified by AWS Glue is: AWS Glue 识别的错误消息是:

An error occurred while calling o72.pyWriteDynamicFrame调用 o72.pyWriteDynamicFrame 时出错

The log file also contains:日志文件还包含:

Job aborted due to stage failure: ... Task failed while writing rows由于阶段失败,作业中止:...写入行时任务失败

Any idea how to find out the reason for failure?知道如何找出失败的原因吗?

Or what it could be?或者它可能是什么?

Part 1: identifying the problem第 1 部分:确定问题

The solution how to find what is causing the problem was to switch output from .parquet to .csv and drop ResolveChoice or DropNullFields (as it is automatically suggested by Glue for .parquet ):如何找到导致问题的原因的解决方案是将输出从.parquet切换到.csv并删除ResolveChoiceDropNullFields (因为 Glue 为.parquet自动建议):

datasink2 = glueContext.write_dynamic_frame.from_options(frame = applymapping1, connection_type = "s3", connection_options = {"path": "s3://xxxx"}, format = "csv", transformation_ctx = "datasink2")
job.commit()

It has produced the more detailed error message:它产生了更详细的错误消息:

An error occurred while calling o120.pyWriteDynamicFrame.调用 o120.pyWriteDynamicFrame 时出错。 Job aborted due to stage failure: Task 5 in stage 0.0 failed 4 times, most recent failure: Lost task 5.3 in stage 0.0 (TID 182, ip-172-31-78-99.ec2.internal, executor 15): com.amazonaws.services.glue.util.FatalException: Unable to parse file: xxxx1.csv.gz由于阶段失败而中止作业:阶段 0.0 中的任务 5 失败了 4 次,最近的失败:阶段 0.0 中的任务 5.3 丢失(TID 182,ip-172-31-78-99.ec2.internal,执行器 15):com。 amazonaws.services.glue.util.FatalException:无法解析文件:xxxx1.csv.gz

The file xxxx1.csv.gz mentioned in the error message appears to be too big for Glue (~100Mb .gzip and ~350Mb as uncompressed .csv ).错误消息中提到的文件xxxx1.csv.gz对于 Glue 来说似乎太大了(~100Mb .gzip和 ~350Mb 作为未压缩的.csv )。

Part 2: true source of the problem and fix第 2 部分:问题的真正根源和修复

As mentioned in the 1st part thanks to export to .csv it was possible to identify the wrong file.正如第一部分中提到的,由于导出到.csv ,因此可以识别错误的文件。

Further investigation by loading .csv into R has revealed that one of the columns contains a single string record, while all other values of this column were long or NULL .通过将 .csv 加载到 R 中的进一步调查显示,其中一列包含单个string记录,而该列的所有其他值都是longNULL

After dropping this value in R and re-uploading data to S3 the problem vanished.在 R 中删除这个值并将数据重新上传到 S3 后,问题就消失了。

Note #1: the column was declared string in Athena so I consider this behaviour as bug注意 #1:该列在 Athena 中被声明为string ,所以我认为这种行为是错误

Note #2: the nature of the problem was not the size of the data.注意#2:问题的本质不是数据的大小。 I have successfuly processed files up to 200Mb .csv.gz which correspond to roughtly 600 Mb .csv .我已经成功处理了高达 200Mb .csv.gz文件,大约相当于 600 Mb .csv

Please use the updated table schema from the data catalog.请使用数据目录中更新的表架构。

I have gone through this same error.我经历过同样的错误。 In my case, the crawler had created another table of the same file in the database.在我的例子中,爬虫在数据库中创建了另一个相同文件的表。 I was referencing the old one.我指的是旧的。 This can happen if crawler was crawling again and again the same path and made different schema table in data catalog.如果爬虫一次又一次地爬行相同的路径并在数据目录中创建不同的模式表,就会发生这种情况。 So glue job wasn't finding the table name and schema.所以粘合工作没有找到表名和模式。 Thereby giving this error.从而给出这个错误。

Moreover, you can change DeleteBehavior: "LOG" to DeleteBehavior: "DELETE_IN_DATABASE"此外,您可以将 DeleteBehavior: "LOG" 更改为 DeleteBehavior: "DELETE_IN_DATABASE"

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM