A Spark job on EMR taking suddenly 30 hours (up from 5 hours)

Question

I'm relatively new to Spark. I have a Spark job that runs on an Amazon EMR cluster of 1 master and 8 cores. In a nutshell, the Spark job reads some .csv files from S3, transforms them to RDDs, performs some relatively complex joins on the RDDs and finally produces other .csv files on S3. This job, executed on the EMR cluster, used to take about 5 hours. Suddenly, one of these days, it started to take over 30 hours and it does so ever since. There is no apparent difference in the inputs (the S3 files).

I've checked the logs and in the lengthy run (30 hours) I can see something about OutOfMemory errors:

java.lang.OutOfMemoryError: Java heap space
        at java.util.IdentityHashMap.resize(IdentityHashMap.java:472)
        at java.util.IdentityHashMap.put(IdentityHashMap.java:441)
        at org.apache.spark.util.SizeEstimator$SearchState.enqueue(SizeEstimator.scala:174)
        at org.apache.spark.util.SizeEstimator$$anonfun$visitSingleObject$1.apply(SizeEstimator.scala:225)
        at org.apache.spark.util.SizeEstimator$$anonfun$visitSingleObject$1.apply(SizeEstimator.scala:224)
        at scala.collection.immutable.List.foreach(List.scala:381)
        at org.apache.spark.util.SizeEstimator$.visitSingleObject(SizeEstimator.scala:224)
        at org.apache.spark.util.SizeEstimator$.org$apache$spark$util$SizeEstimator$$estimate(SizeEstimator.scala:201)
        at org.apache.spark.util.SizeEstimator$.estimate(SizeEstimator.scala:69)
....

        at org.apache.spark.broadcast.TorrentBroadcast._value(TorrentBroadcast.scala:66)
        at org.apache.spark.broadcast.TorrentBroadcast.getValue(TorrentBroadcast.scala:96)
        at org.apache.spark.broadcast.Broadcast.value(Broadcast.scala:70)

In spite of the apparent OutOfMemory exception(s), the outputs (the S3 files) look good, so apparently the Spark job finishes properly.

What could suddenly produce the jump from 5 hours execution to 30 hours ? How would you go about investigating such an issue ?

Answer 1

Spark retries on failure. Your processes are failing. When that happens, all active tasks are probably considered failed, so requeued elsewhere in the cluster.

A Spark job on EMR taking suddenly 30 hours (up from 5 hours)

Question

1 answers

solution1
2 2019-04-18 20:30:32

A Spark job on EMR taking suddenly 30 hours (up from 5 hours)

Question

1 answers

solution1 2 2019-04-18 20:30:32

solution1
2 2019-04-18 20:30:32