'java.lang.OutOfMemoryError: Java heap space' 在尝试读取 avro 文件并执行操作时 Spark 应用程序中出现错误

Question

The avro size is around 44MB. avro 大小约为 44MB。

Below is the yarn logs error :以下是纱线日志错误：

20/03/30 06:55:04 INFO spark.ExecutorAllocationManager: Existing executor 18 has been removed (new total is 0)
20/03/30 06:55:04 INFO cluster.YarnClusterScheduler: Cancelling stage 5
20/03/30 06:55:04 INFO scheduler.DAGScheduler: ResultStage 5 (head at IrdsFIInstrumentEnricher.scala:15) failed in 213.391 s due to Job aborted due to stage f        ailure: Task 0 in stage 5.0 failed 4 times, most recent failure: Lost task 0.3 in stage 5.0 (TID 134, fratlhadooappd30.de.db.com, executor 18): ExecutorLostFa        ilure (executor 18 exited caused by one of the running tasks) Reason: Container marked as failed: container_1585337469684_0037_02_000029 on host: fratlhadooap        pd30.de.db.com. Exit status: 143. Diagnostics: Container killed on request. Exit code is 143
Container exited with a non-zero exit code 143
Killed by external signal

Driver stacktrace:
20/03/30 06:55:04 INFO scheduler.DAGScheduler: Job 3 failed: head at IrdsFIInstrumentEnricher.scala:15, took 213.427308 s
20/03/30 06:55:04 ERROR CCOIrdsEnrichmentService: Unexpected error
org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 5.0 failed 4 times, most recent failure: Lost task 0.3 in stage 5.0 (TID 13        4, fratlhadooappd30.de.db.com, executor 18): ExecutorLostFailure (executor 18 exited caused by one of the running tasks) Reason: Container marked as failed: c        ontainer_1585337469684_0037_02_000029 on host: fratlhadooappd30.de.db.com. Exit status: 143. Diagnostics: Container killed on request. Exit code is 143
Container exited with a non-zero exit code 143
Killed by external signal

Driver stacktrace:
→ at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1435)
.
.
.
.
.
.
.
.
20/03/30 06:48:19 INFO storage.DiskBlockManager: Shutdown hook called
20/03/30 06:48:19 INFO util.ShutdownHookManager: Shutdown hook called

LogType:stdout
Log Upload Time:Mon Mar 30 06:55:10 +0200 2020
LogLength:124
Log Contents:

 java.lang.OutOfMemoryError: Java heap space
 -XX:OnOutOfMemoryError="kill %p"
   Executing /bin/sh -c "kill 62191"...

LogType:container-localizer-syslog
Log Upload Time:Mon Mar 30 06:55:10 +0200 2020
LogLength:0
Log Contents:

Below is the code I am using :以下是我正在使用的代码：

fiDF = spark.read
  .format("com.databricks.spark.avro")
  .load("C:\\Users\\kativikb\\Downloads\\Temp\\cco-irds\\rds_db_global_rds_fi-instrument_20200328000000_v1_block3_snapshot-inc.avro").limit(1)

val tempDF = fiDF.select("payload.identifier.id")
tempDF.show(10) // ******* Error at t his line ******

Answer 1

This was because the avro schema was too large, and I was using the spark version 2.1.0, which perhaps has bug for larger schemas.这是因为 avro 模式太大，而我使用的是 spark 版本 2.1.0，对于较大的模式可能存在错误。 this has been fixed in 2.4.0.这已在 2.4.0 中修复。

I solved this error by changing the schema and using my custom schema, taking only the required fields in the schema.我通过更改架构并使用我的自定义架构解决了这个错误，只采用架构中的必填字段。

'java.lang.OutOfMemoryError: Java heap space' 在尝试读取 avro 文件并执行操作时 Spark 应用程序中出现错误

问题描述

1 个解决方案

解决方案1
0 2020-04-03 14:10:58

&#39;java.lang.OutOfMemoryError: Java heap space&#39; 在尝试读取 avro 文件并执行操作时 Spark 应用程序中出现错误

问题描述

1 个解决方案

解决方案1 0 2020-04-03 14:10:58

'java.lang.OutOfMemoryError: Java heap space' 在尝试读取 avro 文件并执行操作时 Spark 应用程序中出现错误

解决方案1
0 2020-04-03 14:10:58