简体   繁体   English

如何在 Python 脚本中获取 PBS/Torque/Slurm 作业完成状态

[英]How to get PBS/Torque/Slurm job completion status in a Python script

I am writing a python script which prepares input files for hundreds of jobs (say job.id = 1 to job.id = 1000, where job.id is a self-assigned id) and then submit them on a cluster for execution.我正在编写一个 python 脚本,它为数百个作业准备输入文件(比如 job.id = 1 到 job.id = 1000,其中 job.id 是一个自分配的 id),然后将它们提交到集群上以供执行。 Each job has 3 stages, s1, s2 and s3, where s2 is dependent on the results of s1 and s3 is dependent on results of s2.每个作业有 3 个阶段,s1、s2 和 s3,其中 s2 取决于 s1 的结果,s3 取决于 s2 的结果。 Each job may take 3 to 4 days using 48-64 cpu cores on super-cluster.在超级集群上使用 48-64 个 cpu 内核,每个作业可能需要 3 到 4 天。 I want my script to automatically handle all the stages for each job.我希望我的脚本能够自动处理每项工作的所有阶段。 One way that I thought is to submit s1 stage for all jobs at once and then periodically check the status of either output files (if exists) for all jobs or read in the queue status and see if a particular job disappeared from the queue (ie gets completed), after each 5 or 10 or 12 hours.我认为的一种方法是一次为所有作业提交 s1 阶段,然后定期检查所有作业的输出文件(如果存在)的状态或读取队列状态并查看特定作业是否从队列中消失(即完成),每 5 或 10 或 12 小时后。 A basic layout of my script is as follows.我的脚本的基本布局如下。

import sched, time
from subprocess import *

jobs_running = True
s = sched.scheduler(time.time, time.sleep)

def Prepare():
    print "prepare jobs by reading some source files"
    print "set some flages for each job, e.g. job.id, job.stage, etc."
    print "submit jobs using < Popen('qsub nNodes Ncores jobinputfile') > "

def JobStatus():
    global jobs_running
    print "check status of each job"
    """
    for job in jobs:
        if job.stage1 == complete:
           print "goto stage 2"
           print "reset job.stage flages etc."
         elif job.stage2 == complete:
           print " go to stage 3"
           .
           .
         else last stage:

    if all stages complete for all jobs:
       set (global var) jobs_running = False
    """

def SecondStage():
    print " prepare for second stage "
    print " submit using < Popen('qsub nNodes Ncores jobinputfile') > "

def TimeSchedular(sc): 
    global jobs_running
    JobStatus()
    if jobs_running :
        s.enter(36000, 1, TimeSchedular, (sc,))

if (__name__ == "__main__"):    
    Prepare()
    s.enter(36000, 1, TimeSchedular, (s,))
    s.run()

This definitely is not an elegant solution for many reasons.出于多种原因,这绝对不是一个优雅的解决方案。 For example, I have to check the status of every job in each cycle.例如,我必须检查每个周期中每个作业的状态。 Also if a job gets completed right after checking the status, it will be waiting for next 5 or 10 or 12 hours, to be submitted for the next stage.此外,如果作业在检查状态后立即完成,它将等待接下来的 5 或 10 或 12 小时,以提交下一阶段。 So my question is:所以我的问题是:

Is there some way to directly get job completion signal from PBS/SLURM or from the system in the above layout for, say job.id = 99, so that it can go to next stage (with out checking the status of the rest of the jobs) ?是否有某种方法可以直接从 PBS/SLURM 或从上述布局中的系统获取作业完成信号,例如 job.id = 99,以便它可以进入下一阶段(无需检查其余部分的状态)工作) ? Or can someone suggest a better solution?或者有人可以提出更好的解决方案吗?

The normal way to accomplish this is through job dependencies .完成此操作的正常方法是通过作业依赖项 For example, if you have a job that depends on a another job before it can start, you can do something like this:例如,如果您的作业在开始之前依赖于另一个作业,您可以执行以下操作:

jobid1=`qsub phase_one.sh`
jobid2=`qsub phase_two.sh -W depend=afterok:${jobid1}`
# and so on as needed

The link there goes to the Torque documentation.那里的链接转到 Torque 文档。 I'm fairly certain that most any resource manager offers similar functionality.我相当肯定,大多数资源管理器都提供类似的功能。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM