简体   繁体   中英

A Spark job on EMR taking suddenly 30 hours (up from 5 hours)

I'm relatively new to Spark. I have a Spark job that runs on an Amazon EMR cluster of 1 master and 8 cores. In a nutshell, the Spark job reads some .csv files from S3, transforms them to RDDs, performs some relatively complex joins on the RDDs and finally produces other .csv files on S3. This job, executed on the EMR cluster, used to take about 5 hours. Suddenly, one of these days, it started to take over 30 hours and it does so ever since. There is no apparent difference in the inputs (the S3 files).

I've checked the logs and in the lengthy run (30 hours) I can see something about OutOfMemory errors:

java.lang.OutOfMemoryError: Java heap space
        at java.util.IdentityHashMap.resize(IdentityHashMap.java:472)
        at java.util.IdentityHashMap.put(IdentityHashMap.java:441)
        at org.apache.spark.util.SizeEstimator$SearchState.enqueue(SizeEstimator.scala:174)
        at org.apache.spark.util.SizeEstimator$$anonfun$visitSingleObject$1.apply(SizeEstimator.scala:225)
        at org.apache.spark.util.SizeEstimator$$anonfun$visitSingleObject$1.apply(SizeEstimator.scala:224)
        at scala.collection.immutable.List.foreach(List.scala:381)
        at org.apache.spark.util.SizeEstimator$.visitSingleObject(SizeEstimator.scala:224)
        at org.apache.spark.util.SizeEstimator$.org$apache$spark$util$SizeEstimator$$estimate(SizeEstimator.scala:201)
        at org.apache.spark.util.SizeEstimator$.estimate(SizeEstimator.scala:69)
....

        at org.apache.spark.broadcast.TorrentBroadcast._value(TorrentBroadcast.scala:66)
        at org.apache.spark.broadcast.TorrentBroadcast.getValue(TorrentBroadcast.scala:96)
        at org.apache.spark.broadcast.Broadcast.value(Broadcast.scala:70)

In spite of the apparent OutOfMemory exception(s), the outputs (the S3 files) look good, so apparently the Spark job finishes properly.

What could suddenly produce the jump from 5 hours execution to 30 hours ? How would you go about investigating such an issue ?

Spark retries on failure. Your processes are failing. When that happens, all active tasks are probably considered failed, so requeued elsewhere in the cluster.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM