Running a spark job and getting the job id via script

Question

Similar to Getting app run id for a Spark job , except from the command line or a script.

I am running spark-submit automatically from our continuous deployment system, and I need to track the application ID so that I can kill it before running the job again (and various other needs).

Specifically, this is a Python script that executes the job on a YARN cluster, and can read standard output from spark-submit , which we need to save the application ID for a later time.

The best plan I can figure so far is to run spark-submit , watch standard output and extract the application ID, then detach from the process. This method is not ideal in my opinion.

Preferably, spark-submit would (only) print out the application ID, then fork, and so far I don't see any way of doing this apart from modifying Spark itself.

Is there a nicer, more obvious way of doing this?

Answer 1

I've created a wrapper script that extracts the application ID for you. It's hosted at: https://github.com/gak/spark-submit-app-id-wrapper

Example:

# pip install spark-submit-app-id-wrapper

# ssaiw spark-submit --master yarn-cluster --class etc etc > /dev/null
application_1448925599375_0050

Now the CI script can run spark-submit via ssaiw and grab the application id as soon as possible.

Note that it has only been tested with YARN.

Running a spark job and getting the job id via script

Question

1 answers

solution1
0 ACCPTED 2015-12-03 02:51:46

Running a spark job and getting the job id via script

Question

1 answers

solution1 0 ACCPTED 2015-12-03 02:51:46

solution1
0 ACCPTED 2015-12-03 02:51:46