简体   繁体   English

是否可以使用 SRUN 而不是 SBATCH 在后台运行 SLURM 作业?

[英]Is it possible to run SLURM jobs in the background using SRUN instead of SBATCH?

I was trying to run slurm jobs with srun on the background.我试图在后台运行带有 srun 的 slurm 作业。 Unfortunately, right now due to the fact I have to run things through docker its a bit annoying to use sbatch so I am trying to find out if I can avoid it all together.不幸的是,现在由于我必须通过 docker 运行它,所以使用 sbatch 有点烦人所以我试图找出是否可以一起避免它。

From my observations, whenever I run srun, say:根据我的观察,每当我运行 srun 时,都会说:

srun docker image my_job_script.py

and close the window where I was running the command (to avoid receiving all the print statements) and open another terminal window to see if the command is still running, it seems that my running script is for some reason cancelled or something.并关闭我运行命令的 window(以避免接收所有打印语句)并打开另一个终端 window 以查看命令是否仍在运行,似乎我的运行脚本由于某种原因被取消或其他原因。 Since it isn't through sbatch it doesn't send me a file with the error log (as far as I know) so I have no idea why it closed.由于它不是通过 sbatch,它不会向我发送带有错误日志的文件(据我所知),所以我不知道它为什么关闭。

I also tried:我也试过:

srun docker image my_job_script.py &

to give control back to me in the terminal.在终端中将控制权还给我。 Unfortunately, if I do that it still keeps printing things to my terminal screen, which I am trying to avoid.不幸的是,如果我这样做,它仍然会继续将内容打印到我的终端屏幕上,而我正试图避免这种情况。

Essentially, I log into a remote computer through ssh and then do a srun command, but it seems that if I terminate the communication of my ssh connection, the srun command is automatically killed.本质上,我通过 ssh 登录到远程计算机,然后执行 srun 命令,但似乎如果我终止我的 ssh 连接的通信,srun 命令会自动终止。 Is there a way to stop this?有办法阻止这种情况吗?

Ideally I would like to essentially send the script to run and not have it be cancelled for any reason unless I cancel it through scancel and it should not print to my screen.理想情况下,我基本上希望发送脚本运行并且不会因为任何原因取消它,除非我通过scancel取消它并且它不应该打印到我的屏幕上。 So my ideal solution is:所以我理想的解决方案是:

  1. keep running srun script even if I log out of the ssh session即使我退出 ssh session 也继续运行 srun 脚本
  2. keep running my srun script even if close the window from where I sent the command即使从我发送命令的地方关闭 window,也要继续运行我的 srun 脚本
  3. keep running my srun script and let me leave the srun session and not print to my scree (ie essentially run to the background)继续运行我的 srun 脚本,让我离开 srun session 并且不打印到我的屏幕(即基本上运行到后台)

this would be my idea solution.这将是我的想法解决方案。


For the curious crowd that want to know the issue with sbatch, I want to be able to do ( which is the ideal solution ):对于想知道 sbatch 问题的好奇人群,我希望能够做到(这是理想的解决方案):

sbatch docker image my_job_script.py

however, as people will know it does not work because sbatch receives the command docker which isn't a "batch" script.但是,正如人们所知,它不起作用,因为 sbatch 收到命令 docker,这不是“批处理”脚本。 Essentially a simple solution (that doesn't really work for my case) would be to wrap the docker command in a batch script:本质上,一个简单的解决方案(对我的情况并不适用)是将 docker 命令包装在批处理脚本中:

#!/usr/bin/sh
docker image my_job_script.py

unfortunately I am actually using my batch script to encode a lot of information (sort of like a config file) of the task I am running.不幸的是,我实际上正在使用我的批处理脚本来编码我正在运行的任务的大量信息(有点像配置文件)。 So doing that might affect jobs I do because their underlying file is changing.所以这样做可能会影响我所做的工作,因为它们的基础文件正在发生变化。 That is avoided by sending the job directly to sbatch since it essentially creates a copy of the batch script (as noted in this question: Changing the bash script sent to sbatch in slurm during run a bad idea? ).通过将作业直接发送到 sbatch 可以避免这种情况,因为它实际上创建了批处理脚本的副本(如本问题所述: Changing the bash script sent to slurm during run a bad idea? )。 So the real solution to my problem would be to actually have my batch script contain all the information that my script requires and then somehow in python call docker and at the same time pass it all the information.所以我的问题的真正解决方案是让我的批处理脚本包含我的脚本所需的所有信息,然后以某种方式在 python 中调用 docker 并同时将所有信息传递给它。 Unfortunately, some of the information are function pointers and objects, so its not even clear to me how I would pass such a thing to a docker command ran in python.不幸的是,一些信息是 function 指针和对象,所以我什至不清楚如何将这样的东西传递给在 python 中运行的 docker 命令。


or maybe being able to run docker directly to sbatch instead of using a batch script with also solve the problem.或者也许能够直接运行 docker 到 sbatch 而不是使用批处理脚本也可以解决问题。

The outputs can be redirected with the options -o stdout and -e for stderr . 输出可以使用选项-o stdout重定向, -e用于stderr

So, the job can be launched in background and with the outputs redirected: 因此,可以在后台启动作业并重定向输出:

$ srun -o file.out -e file.errr docker image my_job_script.py &

Another approach is to use a terminal multiplexer like tmux or screen . 另一种方法是使用像tmux屏幕这样的终端多路复用器。

For example, create a new tmux window type tmux . 例如,创建一个新的tmux窗口类型tmux In that window, use srun with your script. 在该窗口中,使用srun和脚本。 From there, you can then detach the tmux window, which returns you to your main shell so you can go about your other business, or you can logoff entirely. 从那里,您可以分离tmux窗口,该窗口将您返回到主shell,以便您可以开展其他业务,或者您可以完全注销。 When you want to check in on your script, just reattach to the tmux window. 如果要签入脚本,只需重新连接到tmux窗口即可。 See the documentation tmux -h for how to detach and reattach on your OS. 有关如何分离和重新连接操作系统的信息,请参阅文档tmux -h

Any output redirects using the -o or -e will still work with this technique and you can run multiple srun commands concurrently in different tmux windows. 使用-o-e任何输出重定向仍将使用此技术,您可以在不同的tmux窗口中同时运行多个srun命令。 I've found this approach useful, especially when developing a script that takes hours to run. 我发现这种方法很有用,特别是在开发需要数小时才能运行的脚本时。

I was wondering this too because the differences between sbatch and srun are not very clearly explainer or motivated.我也想知道这一点,因为sbatchsrun之间的差异不是很清楚地解释或动机。 I looked at the code and found:我查看了代码,发现:

sbatch

sbatch pretty much just sends a shell script to the controller, tells it to run it and then exits. sbatch几乎只是将 shell 脚本发送到 controller,告诉它运行它然后退出。 It does not need to keep running while the job is happening.它不需要在作业进行时继续运行。 It does have a --wait option to stay running until the job is finished but all it does is poll the controller every 2 seconds to ask it.它确实有一个--wait选项可以保持运行直到作业完成,但它所做的只是每 2 秒轮询一次 controller 以询问它。

sbatch can't run a job across multiple nodes - the code simply isn't in sbatch.c . sbatch无法跨多个节点运行作业 - 代码根本不在sbatch.c中。 sbatch is not implemented in terms of srun , it's a totally different thing. sbatch不是根据srun实现的,这是完全不同的事情。

Also its argument must be a shell script.此外,它的参数必须是 shell 脚本。 Bit of a weird limitation but it does have a --wrap option so that it can automatically wrap a real program in a shell script for you.有点奇怪的限制,但它确实有一个--wrap选项,因此它可以自动为您将真实程序包装在 shell 脚本中。 Good luck getting all the escaping right with that!祝所有 escaping 正确无误!

srun

srun is more like an MPI runner. srun更像是一个 MPI 运行程序。 It directly starts tasks on lots of nodes (one task per node by default though you can override that with --ntasks ).它直接在许多节点上启动任务(默认情况下每个节点一个任务,但您可以使用--ntasks覆盖它)。 It's intended for MPI so all of the jobs will run simultaneously.它适用于 MPI,因此所有作业都将同时运行。 It won't start any until all the nodes have a slot free.直到所有节点都有空闲时,它才会开始。

It must keep running while the job is in progress.必须在作业进行时保持运行。 You can send it to the background with & but this is still different to sbatch .您可以使用&将其发送到后台,但这仍然与sbatch不同。 If you need to start a million srun s you're going to have a problem.如果您需要启动一百万srun ,您就会遇到问题。 A million sbatch s should (in theory) work fine.一百万sbatch应该(理论上)可以正常工作。

There is no way to have srun exit and leave the job still running like there is with sbatch .没有办法让srun退出并让作业继续运行,就像sbatch srun itself acts as a coordinator for all of the nodes in the job, and it updates the job status etc. so it needs to be running for the whole thing. srun本身充当作业中所有节点的协调器,并更新作业状态等,因此它需要在整个过程中运行。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM