EMR上的Spark作业突然耗时30小时（原为5小时）

Question

I'm relatively new to Spark. 我是Spark的新手。 I have a Spark job that runs on an Amazon EMR cluster of 1 master and 8 cores. 我有一个Spark作业，该作业在1个主服务器和8个核心的Amazon EMR集群上运行。 In a nutshell, the Spark job reads some .csv files from S3, transforms them to RDDs, performs some relatively complex joins on the RDDs and finally produces other .csv files on S3. 简而言之，Spark作业从S3读取一些.csv文件，将它们转换为RDD，在RDD上执行一些相对复杂的联接，最后在S3上生成其他.csv文件。 This job, executed on the EMR cluster, used to take about 5 hours. 在EMR集群上执行的这项工作过去大约需要5个小时。 Suddenly, one of these days, it started to take over 30 hours and it does so ever since. 突然间，这些日子之一开始耗时超过30个小时 ，此后一直如此。 There is no apparent difference in the inputs (the S3 files). 输入（S3文件）没有明显区别。

I've checked the logs and in the lengthy run (30 hours) I can see something about OutOfMemory errors: 我检查了日志，并进行了长时间的运行（30小时），我可以看到有关OutOfMemory错误的信息：

java.lang.OutOfMemoryError: Java heap space
        at java.util.IdentityHashMap.resize(IdentityHashMap.java:472)
        at java.util.IdentityHashMap.put(IdentityHashMap.java:441)
        at org.apache.spark.util.SizeEstimator$SearchState.enqueue(SizeEstimator.scala:174)
        at org.apache.spark.util.SizeEstimator$$anonfun$visitSingleObject$1.apply(SizeEstimator.scala:225)
        at org.apache.spark.util.SizeEstimator$$anonfun$visitSingleObject$1.apply(SizeEstimator.scala:224)
        at scala.collection.immutable.List.foreach(List.scala:381)
        at org.apache.spark.util.SizeEstimator$.visitSingleObject(SizeEstimator.scala:224)
        at org.apache.spark.util.SizeEstimator$.org$apache$spark$util$SizeEstimator$$estimate(SizeEstimator.scala:201)
        at org.apache.spark.util.SizeEstimator$.estimate(SizeEstimator.scala:69)
....

        at org.apache.spark.broadcast.TorrentBroadcast._value(TorrentBroadcast.scala:66)
        at org.apache.spark.broadcast.TorrentBroadcast.getValue(TorrentBroadcast.scala:96)
        at org.apache.spark.broadcast.Broadcast.value(Broadcast.scala:70)

In spite of the apparent OutOfMemory exception(s), the outputs (the S3 files) look good, so apparently the Spark job finishes properly. 尽管存在明显的OutOfMemory异常，但输出（S3文件）看起来还是不错的，因此，显然Spark作业正常完成了。

What could suddenly produce the jump from 5 hours execution to 30 hours ? 什么会突然导致执行时间从5小时跳到30小时？ How would you go about investigating such an issue ? 您将如何调查此类问题？

Answer 1

Spark retries on failure. Spark在失败时重试。 Your processes are failing. 您的流程失败。 When that happens, all active tasks are probably considered failed, so requeued elsewhere in the cluster. 发生这种情况时，所有活动任务都可能被视为失败，因此在集群中的其他地方重新排队。

EMR上的Spark作业突然耗时30小时（原为5小时）

问题描述

1 个解决方案

解决方案1
2 2019-04-18 20:30:32

EMR上的Spark作业突然耗时30小时（原为5小时）

问题描述

1 个解决方案

解决方案1 2 2019-04-18 20:30:32

解决方案1
2 2019-04-18 20:30:32