简体繁体中英

How is abnormal Driver termination handled for a Spark App in Yarn cluster mode

原文 2020-04-08 13:11:16 1 2 apache-spark/ amazon-emr

We're using AWS EMR for our spark jobs. All our jobs are submitted in yarn cluster mode, so the driver will run in one of the cluster nodes. We use on-demand node for master, and spot-instances for the core nodes. Now, although we almost always choose instances with < 5% interruption rate, sometimes it so happens that a significant fraction of our cluster nodes get terminated prematurely (probably because of higher demands).

So, I was wondering, in the above situation, what happens if a node containing the driver process goes down? Is there any chance of recovery for the spark job in that case? Or is the job gone forever?

2 answers

The Spark driver is a single point of failure because it holds all cluster state for the running App.

In practice non-ephemeral storage can be used for check-pointing batch Apps after expensive expensive transformations. That said, trying to re-start after such a situation can be done, but when I looked into it, it is quite difficult to say the least. I asked such a question under my name some time ago, you can find it. I am quite technical but felt: gosh what a lot of hard work.

So, the recovery means rolling your own stuff, or accepting a re-run. Since I last evaluated EMR I see that the driver can run on the Master and that can be failed-over, but that is not the same thing as far as I can see, nor what you wish.

EMR has node leveling for CORE nodes in Yarn. Your spark driver/ Application master only gets created in CORE nodes. And HDFS also resides in CORE nodes only. So to handle your situation in a best way, you may consider to use both CORE and TASK group. What you can do to tackle this -

MASTER: On-demand
CORE: On-demand. Minimum no of Instances can be 1.
TASK: Spot with autoscaling with minimal EBS volume. Minimum no of Instances can be 0 this case.

This will reduce your cost also ensure that node containing the driver process never goes down.

https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-master-core-task-nodes.html

Resources/Documentation on how does the failover process work for the Spark Driver (and its YARN Container) in yarn-cluster mode

submitting PySpark app to spark on YARN in cluster mode

What is the benefit of using more then 1 driver core in spark yarn cluster mode?

How to submit Spark application to YARN in cluster mode?

how to find out driver process node for tasks running in spark in yarn-cluster mode

SparkConf settings not used when running Spark app in cluster mode on YARN

how to : spark yarn cluster

How to run multiple spark jobs parallel on yarn with cluster mode?

How to see spark executing status in yarn-cluster mode on AWS

How YARN knows data locality in Apache spark in cluster mode

暂无

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

Related Question Resources/Documentation on how does the failover process work for the Spark Driver (and its YARN Container) in yarn-cluster mode submitting PySpark app to spark on YARN in cluster mode What is the benefit of using more then 1 driver core in spark yarn cluster mode? How to submit Spark application to YARN in cluster mode? how to find out driver process node for tasks running in spark in yarn-cluster mode SparkConf settings not used when running Spark app in cluster mode on YARN how to : spark yarn cluster How to run multiple spark jobs parallel on yarn with cluster mode? How to see spark executing status in yarn-cluster mode on AWS How YARN knows data locality in Apache spark in cluster mode

Related Tags

How is abnormal Driver termination handled for a Spark App in Yarn cluster mode

Question

2 answers

solution1
1 ACCPTED 2020-04-08 15:55:26

solution2
0 2020-04-19 13:25:59

How is abnormal Driver termination handled for a Spark App in Yarn cluster mode

Question

2 answers

solution1 1 ACCPTED 2020-04-08 15:55:26

solution2 0 2020-04-19 13:25:59

solution1
1 ACCPTED 2020-04-08 15:55:26

solution2
0 2020-04-19 13:25:59