简体   繁体   English

LSF - 使用 sasbatch 脚本自动重新运行作业

[英]LSF - automatic job rerun using sasbatch script

I am trying to create an auto-rerun mechanism by implementing some code into sasbatch script after sascommand will finish.我正在尝试通过在 sascommand 完成后将一些代码实现到 sasbatch 脚本中来创建自动重新运行机制。 General idea is to:总体思路是:

  1. locate a log of sas process and an id of the flow containing current job,找到 sas 进程的日志和包含当前作业的流的 id,

  2. check if the log contains particular ORA-xxxxx errors which we know that solution for them is just rerun of the process,检查日志是否包含特定的 ORA-xxxxx 错误,我们知道它们的解决方案只是重新运行进程,

  3. if so, then trigger jrerun class from LSF Platform Command Line Interface,如果是,则从 LSF 平台命令行界面触发 jrerun 类,

  4. exit sasbatch passing $rc to LSF退出 sasbatch 将 $rc 传递给 LSF

The idea was implemented as:这个想法被实现为:

#define used paths
log_dir=/path/to/sas_logs_directory
out_log=/path/to/auto-rerun_log.txt
out_log2=/path/to/lsf_rerun_log.txt

if [ -n "${LSB_JOBNAME}"]; then
    if [ ! -f "$out_log"]; then
        touch $out_log
    fi
    #get flow runtime attributes
    IFS-: read -r flow_id username flow_name job_name <<< "${LSB_JOBNAME}"

    #find log of the current process
    log_path=$(ls -t $log_dir/*.log | xargs grep -li "job:\s*$job_name" | grep -i "/$flow_name_" | head -1)

    #set path to txt file containing lines which represents ORA errors we look for
    conf_path-/path/to/error_list

    #analyse process' log line by line
    while read -r line;
    do
        #if error is found in log then try to rerun flow
        if grep -q "$line" $log_path; then
            (nohup /path/to/rerun_script.sh $flow_id >$out_log2 2>&1) &
            disown
            break
        fi
    done < $conf_path
fi

While rerun_script is the script which calls jrerun class after sleep command - in order to let parent script exit $rc in the meanwhile.而 rerun_script 是在 sleep 命令之后调用 jrerun 类的脚本 - 为了让父脚本同时退出 $rc 。 It looks like:看起来像:

sleep 10
/some/lsf/path/jrerun

Problem is that job is running for the all time.问题是作业一直在运行。 In LSF history I can see that jrerun was called before job exited.在 LSF 历史中,我可以看到 jrerun 在作业退出之前被调用。 Furthermore in $out_log2 I can see message: <flow_id> has no starting or exit points.此外,在 $out_log2 中我可以看到消息: <flow_id> has no starting or exit points.

Do anyone have an idea how I can pass return code to LSF before jrerun calling?有没有人知道如何在 jrerun 调用之前将返回码传递给 LSF? Or maybe some simplier way to perform autorerun of SAS jobs in Platform LSF?或者也许是在 Platform LSF 中执行自动重新运行 SAS 作业的一些更简单的方法?

I am using SAS 9.4 and Platform Process Manager 9.1我使用的是 SAS 9.4 和 Platform Process Manager 9.1

Or maybe some simplier way to perform autorerun of SAS jobs in Platform LSF?或者也许是在 Platform LSF 中执行自动重新运行 SAS 作业的一些更简单的方法?

I'm not knowledgeable about the SAS part.我对 SAS 部分不了解。 But on the LSF side there's at least a couple of ways to requeue the job.但在 LSF 方面,至少有几种方法可以重新排队工作。

If you have control of the job script, you can use special process exit value to automatically requeue the job.如果您可以控制作业脚本,则可以使用特殊的进程退出值来自动重新排队作业。

https://www.ibm.com/support/knowledgecenter/en/SSWRJV_10.1.0/lsf_admin/job_requeue_about.html https://www.ibm.com/support/knowledgecenter/en/SSWRJV_10.1.0/lsf_admin/job_requeue_about.html

If you have control outside of the job script, you can use brequeue -r to requeue a running job.如果您在作业脚本之外拥有控制权,则可以使用brequeue -r重新排队正在运行的作业。

https://www.ibm.com/support/knowledgecenter/en/SSWRJV_10.1.0/lsf_command_ref/brequeue.1.html https://www.ibm.com/support/knowledgecenter/en/SSWRJV_10.1.0/lsf_command_ref/brequeue.1.html

Good Luck祝你好运

I managed to get this working by using two additional configuration files.我设法通过使用两个额外的配置文件来完成这项工作。 When my grep returnes 1 I add found flow_id to flow_list.txt configuration file and modify especially made trigger_file.txt .当我的grep返回 1 时,我将找到的flow_id添加到flow_list.txt配置文件并修改特别制作的trigger_file.txt

I scheduled additional flow execute_rerun in LSF which is triggered after file trigger_file.txt is modified.我在 LSF 中安排了额外的流程execute_rerun ,这是在修改文件trigger_file.txt后触发的。 The execute_rerun flow reads flow_list.txt configuration file line by line and calls jrerun method on each flow. execute_rerunflow_list.txt读取flow_list.txt配置文件,并在每个流上调用jrerun方法。

I managed to achieve an automatic rerun of the flows which fails due to particular errors.我设法实现了由于特定错误而失败的流程的自动重新运行。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM