简体   繁体   中英

Slurm action at job termination or failure

I would like the slurm workload manager to do some action like touch stopped.txt at job termination either due to time out or failure. How can this be done?

When the job has terminated, there is no way for regular users to perform further actions. (Admins can use strigger or setup epilog scripts)

For termination due to time out, the typical course of action is to setup a Bash "trap" to catch a signal and request Slurm to send that signal a few minutes before the job is killed.

For termination due to failure, you can test the return code of your main program inside the submission script and act accordingly.

Another option, which could be seen as overkill, but is easier to implement, is to submit a "monitoring" job, dependent on the job after which some action must be taken, and have that job create the stopped.txt file based on the state of the job in the accounting .

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM