[英]LSF - automatic job rerun using sasbatch script
I am trying to create an auto-rerun mechanism by implementing some code into sasbatch script after sascommand will finish.我正在尝试通过在 sascommand 完成后将一些代码实现到 sasbatch 脚本中来创建自动重新运行机制。 General idea is to:总体思路是:
locate a log of sas process and an id of the flow containing current job,找到 sas 进程的日志和包含当前作业的流的 id,
check if the log contains particular ORA-xxxxx errors which we know that solution for them is just rerun of the process,检查日志是否包含特定的 ORA-xxxxx 错误,我们知道它们的解决方案只是重新运行进程,
if so, then trigger jrerun class from LSF Platform Command Line Interface,如果是,则从 LSF 平台命令行界面触发 jrerun 类,
exit sasbatch passing $rc to LSF退出 sasbatch 将 $rc 传递给 LSF
The idea was implemented as:这个想法被实现为:
#define used paths
log_dir=/path/to/sas_logs_directory
out_log=/path/to/auto-rerun_log.txt
out_log2=/path/to/lsf_rerun_log.txt
if [ -n "${LSB_JOBNAME}"]; then
if [ ! -f "$out_log"]; then
touch $out_log
fi
#get flow runtime attributes
IFS-: read -r flow_id username flow_name job_name <<< "${LSB_JOBNAME}"
#find log of the current process
log_path=$(ls -t $log_dir/*.log | xargs grep -li "job:\s*$job_name" | grep -i "/$flow_name_" | head -1)
#set path to txt file containing lines which represents ORA errors we look for
conf_path-/path/to/error_list
#analyse process' log line by line
while read -r line;
do
#if error is found in log then try to rerun flow
if grep -q "$line" $log_path; then
(nohup /path/to/rerun_script.sh $flow_id >$out_log2 2>&1) &
disown
break
fi
done < $conf_path
fi
While rerun_script is the script which calls jrerun class after sleep command - in order to let parent script exit $rc in the meanwhile.而 rerun_script 是在 sleep 命令之后调用 jrerun 类的脚本 - 为了让父脚本同时退出 $rc 。 It looks like:看起来像:
sleep 10
/some/lsf/path/jrerun
Problem is that job is running for the all time.问题是作业一直在运行。 In LSF history I can see that jrerun was called before job exited.在 LSF 历史中,我可以看到 jrerun 在作业退出之前被调用。 Furthermore in $out_log2 I can see message: <flow_id> has no starting or exit points.
此外,在 $out_log2 中我可以看到消息: <flow_id> has no starting or exit points.
Do anyone have an idea how I can pass return code to LSF before jrerun calling?有没有人知道如何在 jrerun 调用之前将返回码传递给 LSF? Or maybe some simplier way to perform autorerun of SAS jobs in Platform LSF?或者也许是在 Platform LSF 中执行自动重新运行 SAS 作业的一些更简单的方法?
I am using SAS 9.4 and Platform Process Manager 9.1我使用的是 SAS 9.4 和 Platform Process Manager 9.1
Or maybe some simplier way to perform autorerun of SAS jobs in Platform LSF?或者也许是在 Platform LSF 中执行自动重新运行 SAS 作业的一些更简单的方法?
I'm not knowledgeable about the SAS part.我对 SAS 部分不了解。 But on the LSF side there's at least a couple of ways to requeue the job.但在 LSF 方面,至少有几种方法可以重新排队工作。
If you have control of the job script, you can use special process exit value to automatically requeue the job.如果您可以控制作业脚本,则可以使用特殊的进程退出值来自动重新排队作业。
https://www.ibm.com/support/knowledgecenter/en/SSWRJV_10.1.0/lsf_admin/job_requeue_about.html https://www.ibm.com/support/knowledgecenter/en/SSWRJV_10.1.0/lsf_admin/job_requeue_about.html
If you have control outside of the job script, you can use brequeue -r
to requeue a running job.如果您在作业脚本之外拥有控制权,则可以使用brequeue -r
重新排队正在运行的作业。
https://www.ibm.com/support/knowledgecenter/en/SSWRJV_10.1.0/lsf_command_ref/brequeue.1.html https://www.ibm.com/support/knowledgecenter/en/SSWRJV_10.1.0/lsf_command_ref/brequeue.1.html
Good Luck祝你好运
I managed to get this working by using two additional configuration files.我设法通过使用两个额外的配置文件来完成这项工作。 When my grep
returnes 1 I add found flow_id
to flow_list.txt
configuration file and modify especially made trigger_file.txt
.当我的grep
返回 1 时,我将找到的flow_id
添加到flow_list.txt
配置文件并修改特别制作的trigger_file.txt
。
I scheduled additional flow execute_rerun
in LSF which is triggered after file trigger_file.txt
is modified.我在 LSF 中安排了额外的流程execute_rerun
,这是在修改文件trigger_file.txt
后触发的。 The execute_rerun
flow reads flow_list.txt
configuration file line by line and calls jrerun
method on each flow. execute_rerun
流flow_list.txt
读取flow_list.txt
配置文件,并在每个流上调用jrerun
方法。
I managed to achieve an automatic rerun of the flows which fails due to particular errors.我设法实现了由于特定错误而失败的流程的自动重新运行。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.