简体   繁体   中英

Running a spark job and getting the job id via script

Similar to Getting app run id for a Spark job , except from the command line or a script.

I am running spark-submit automatically from our continuous deployment system, and I need to track the application ID so that I can kill it before running the job again (and various other needs).

Specifically, this is a Python script that executes the job on a YARN cluster, and can read standard output from spark-submit , which we need to save the application ID for a later time.

The best plan I can figure so far is to run spark-submit , watch standard output and extract the application ID, then detach from the process. This method is not ideal in my opinion.

Preferably, spark-submit would (only) print out the application ID, then fork, and so far I don't see any way of doing this apart from modifying Spark itself.

Is there a nicer, more obvious way of doing this?

I've created a wrapper script that extracts the application ID for you. It's hosted at: https://github.com/gak/spark-submit-app-id-wrapper

Example:

# pip install spark-submit-app-id-wrapper

# ssaiw spark-submit --master yarn-cluster --class etc etc > /dev/null
application_1448925599375_0050

Now the CI script can run spark-submit via ssaiw and grab the application id as soon as possible.

Note that it has only been tested with YARN.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM