简体   繁体   English

AWS胶水作业在s3上的大输入csv数据失败

[英]AWS Glue job is failing for large input csv data on s3

For small s3 input files (~10GB), glue ETL job works fine but for the larger dataset (~200GB), the job is failing. 对于小的s3输入文件(~10GB),胶水ETL作业可以正常工作,但对于较大的数据集(~200GB),作业失败。

Adding a part of ETL code. 添加部分ETL代码。

# Converting Dynamic frame to dataframe
df = dropnullfields3.toDF()

# create new partition column
partitioned_dataframe = df.withColumn('part_date', df['timestamp_utc'].cast('date'))

# store the data in parquet format on s3 
partitioned_dataframe.write.partitionBy(['part_date']).format("parquet").save(output_lg_partitioned_dir, mode="append")

Job executed for 4 hours and threw error. 工作执行了4个小时并引发了错误。

File "script_2017-11-23-15-07-32.py", line 49, in partitioned_dataframe.write.partitionBy(['part_date']).format("parquet").save(output_lg_partitioned_dir, mode="append") File "/mnt/yarn/usercache/root/appcache/application_1511449472652_0001/container_1511449472652_0001_02_000001/pyspark.zip/pyspark/sql/readwriter.py", line 550, in save File "/mnt/yarn/usercache/root/appcache/application_1511449472652_0001/container_1511449472652_0001_02_000001/py4j-0.10.4-src.zip/py4j/java_gateway.py", line 1133, in call File "/mnt/yarn/usercache/root/appcache/application_1511449472652_0001/container_1511449472652_0001_02_000001/pyspark.zip/pyspark/sql/utils.py", line 63, in deco File "/mnt/yarn/usercache/root/appcache/application_1511449472652_0001/container_1511449472652_0001_02_000001/py4j-0.10.4-src.zip/py4j/protocol.py", line 319, in get_return_value py4j.protocol.Py4JJavaError: An error occurred while calling o172.save. 文件“script_2017-11-23-15-07-32.py”,第49行,分区为_dataframe.write.partitionBy(['part_date'])。format(“parquet”).save(output_lg_partitioned_dir,mode =“append” )文件“/mnt/yarn/usercache/root/appcache/application_1511449472652_0001/container_1511449442652_0001_02_000001/pyspark.zip/pyspark/sql/readwriter.py”,第550行,在保存文件中“/ mnt / yarn / usercache / root / appcache / application_1511449472652_0001 /container_1511449472652_0001_02_000001/py4j-0.10.4-src.zip/py4j/java_gateway.py“,第1133行,在调用文件中”/mnt/yarn/usercache/root/appcache/application_1511449472652_0001/container_1511449472652_0001_02_000001/pyspark.zip/pyspark/sql/ utils.py“,第63行,在deco文件中”/mnt/yarn/usercache/root/appcache/application_1511449472652_0001/container_1511449472652_0001_02_000001/py4j-0.10.4-src.zip/py4j/protocol.py“,第319行,在get_return_value py4j中.protocol.Py4JJavaError:调用o172.save时发生错误。 : org.apache.spark.SparkException: Job aborted. :org.apache.spark.SparkException:作业已中止。 at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply$mcV$sp(FileFormatWriter.scala:147) at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:121) at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:121) at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:57) at org.apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:121) at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand.run(InsertIntoHadoopFsRelationCommand.scala:101) at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:58) at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:56) at org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:74) at org.apache.spark.sql.execution.Sp at org.apache.spark.sql.execution.datasources.FileFormatWriter $$ anonfun $ write $ 1.apply $ mcV $ sp(FileFormatWriter.scala:147)at org.apache.spark.sql.execution.datasources.FileFormatWriter $$ anonfun在org.apache.spark.sql上的org.apache.spark.sql.execution.datasources.FileFormatWriter $$ anonfun $ write $ 1.apply(FileFormatWriter.scala:121)$ $ $ apply(FileFormatWriter.scala:121)。 execution.SQLExecution $ .withNewExecutionId(SQLExecution.scala:57)atg.apache.spache.sspark.sql.execution.datasources.FileFormatWriter $ .write(FileFormatWriter.scala:121)at org.apache.spark.sql.execution.datasources。在org.apache.spache.spark.sql.execution.command.ExecutedCommandExec的org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult $ lzycompute(commands.scala:58)的InsertIntoHadoopFsRelationCommand.run(InsertIntoHadoopFsRelationCommand.scala:101)。 orE.apache.spark.sql.execution.Sp上的org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:74)中的sideEffectResult(commands.scala:56) arkPlan$$anonfun$execute$1.apply(SparkPlan.scala:114) at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:114) at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:135) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:132) at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:113) at org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:87) at org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:87) at org.apache.spark.sql.execution.datasources.DataSource.write(DataSource.scala:492) at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:215) at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:198) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccess arkPlan $$ anonfun $执行$ 1.apply(SparkPlan.scala:114)org.apache.spark.sql.execution.SparkPlan $$ anonfun $执行$ 1.apply(SparkPlan.scala:114)org.apache.spark。位于org.apache.spark.sql.execution的org.apache.spark.rdd.RDDOperationScope $ .withScope(RDDOperationScope.scala:151)的sql.execution.SparkPlan $$ anonfun $ executeQuery $ 1.apply(SparkPlan.scala:135) .SparkPlan.executeQuery(SparkPlan.scala:132)org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:113)at org.apache.spark.sql.execution.QueryExecution.toRdd $ lzycompute(QueryExecution) .scala:87)org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:87)at org.apache.spark.sql.execution.datasources.DataSource.write(DataSource.scala:492)at at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:215)atg.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:198)at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)at at at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccess orImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244) at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357) at py4j.Gateway.invoke(Gateway.java:280) at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132) at py4j.commands.CallCommand.execute(CallCommand.java:79) at py4j.GatewayConnection.run(GatewayConnection.java:214) at java.lang.Thread.run(Thread.java:748) Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: Total size of serialized results of 3385 tasks (1024.1 MB) is bigger than spark.driver.maxResultSize (1024.0 MB) at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1435) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1423) at org.apache.s orImpl.java:62)在sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)at java.lang.reflect.Method.invoke(Method.java:498)py4j.reflection.MethodInvoker.invoke(MethodInvoker.java) :244)at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)at py4j.Gateway.invoke(Gateway.java:280)py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)at py4j.commands .CallCommand.execute(CallCommand.java:79)at py4j.GatewayConnection.run(GatewayConnection.java:214)at java.lang.Thread.run(Thread.java:748)引起:org.apache.spark.SparkException:由于阶段失败导致作业中止:3385任务的序列化结果总大小(1024.1 MB)大于org.apache.spark.scheduler.DAGScheduler.org上的spark.driver.maxResultSize(1024.0 MB)$ apache $ spark $ scheduler $ DAGScheduler $$ failJobAndIndependentStages(DAGScheduler.scala:1435)org.apache.spark.scheduler.DAGScheduler $$ anonfun $ abortStage $ 1.apply(DAGScheduler.scala:1423)at org.apache.s park.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1422) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48) at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1422) at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:802) at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:802) at scala.Option.foreach(Option.scala:257) at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:802) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1650) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1605) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1594) at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48) park.scheduler.DAGScheduler $$ anonfun $ abortStage $ 1.apply(DAGScheduler.scala:1422)at scala.collection.mutable.ResizableArray $ class.foreach(ResizableArray.scala:59)at scala.collection.mutable.ArrayBuffer.foreach( ArrayBuffer.scala:48)org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1422)at org.apache.spark.scheduler.DAGScheduler $$ anonfun $ handleTaskSetFailed $ 1.apply(DAGScheduler.scala:802) at org.apache.spark.scheduler.DAGScheduler $$ anonfun $ handleTaskSetFailed $ 1.apply(DAGScheduler.scala:802)at scala.Option.foreach(Option.scala:257)at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed (dAGScheduler.scala:802)位于org.apache的org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1650)org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1605)。 spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1594)at org.apache.spark.util.EventLoop $$ anon $ 1.run(EventLoop.scala:48) at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:628) at org.apache.spark.SparkContext.runJob(SparkContext.scala:1918) at org.apache.spark.SparkContext.runJob(SparkContext.scala:1931) at org.apache.spark.SparkContext.runJob(SparkContext.scala:1951) at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply$mcV$sp(FileFormatWriter.scala:127) ... 30 more org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:628)org.apache.spark.SparkContext.runJob(SparkContext.scala:1918)org.apache.spark.SparkContext.runJob(SparkContext.scala) :1931)org.apache.spark.SparkContext.runJob(SparkContext.scala:1951)at org.apache.spark.sql.execution.datasources.FileFormatWriter $$ anonfun $ write $ 1.apply $ mcV $ sp(FileFormatWriter.scala :127)......还有30多个

End of LogType:stdout LogType结束:stdout

I would appreciate it if you could provide any guidance to resolve this issue. 如果您能提供解决此问题的任何指导,我将不胜感激。

You can only set configurable options like maxResultSize during context instantiation, and glue provides you with a context (from memory you can't instantiate a new context). 您只能在上下文实例化期间设置可配置选项(如maxResultSize ,而glue会为您提供上下文(从内存中无法实例化新上下文)。 I don't think you will be able to change the value of this property. 我认为您无法更改此属性的值。

You'll normally get this error if you collect results to the driver which exceed the specified size. 如果您向驱动程序收集超过指定大小的结果,通常会收到此错误。 You aren't doing that in this case so the error is confusing. 在这种情况下你不是这样做的,所以错误令人困惑。

It seems like you are spawning 3385 tasks, which are presumably related to the dates in your input file (3385 dates, ~9 years?). 看起来你正在产生3385个任务,这些任务可能与输入文件中的日期相关(3385个日期,〜9年?)。 You might try writing this file in batches, eg 您可以尝试批量编写此文件,例如

partitioned_dataframe = df.withColumn('part_date', df['timestamp_utc'].cast('date'))
for year in range(2000,2018):
    partitioned_dataframe = partitioned_dateframe.where(year(part_date) = year)
    partitioned_dataframe.write.partitionBy(['part_date'])
        .format("parquet")
        .save(output_lg_partitioned_dir, mode="append")

I haven't checked this code; 我没有检查过这段代码; you'll at least need to import pyspark.sql.functions.year for it to work. 你至少需要导入pyspark.sql.functions.year才能工作。

When I've done data processing with Glue I simply found that batching the work was more effective than trying to get large datasets be completed successfully. 当我使用Glue进行数据处理时,我发现批处理工作比尝试成功完成大型数据集更有效。 The system is good but hard to debug; 系统很好但很难调试; the stability on large data doesn't come easily. 大数据的稳定性并不容易。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM