如何使用 Bash 脚本保持所有 GPU 设备运行任务？

Question

I have an 8-GPU server and I'd like to train one neural network on each of them simultaneously.我有一个 8-GPU 服务器，我想同时在每个服务器上训练一个神经网络。 I have several tens of such networks to train and I would like to schedule the training task.我有几十个这样的网络要训练，我想安排训练任务。 Currently I'm writing my own bash script for this scheduling task.目前我正在为此调度任务编写自己的 bash 脚本。

for l1 in {1e-4,2e-4,5e-4,1e-3}; do

      python train.py --lr $l1 --attr 0 --device 0 &
      python train.py --lr $l1 --attr 1 --device 1 &
      python train.py --lr $l1 --attr 2 --device 2 &
      python train.py --lr $l1 --attr 3 --device 3 &
      python train.py --lr $l1 --attr 4 --device 4 &
      python train.py --lr $l1 --attr 5 --device 5 &
      python train.py --lr $l1 --attr 6 --device 6 &
      python train.py --lr $l1 --attr 7 --device 7

      sleep 1
      wait 
done

In the above script, --device flag chooses the GPU to use while other flags just determine the hyper-parameters of my deep neural networks.在上面的脚本中，--device 标志选择--device使用，而其他标志只是确定我的深度神经网络的超参数。 What this script does is that, for each iteration of the for-loop, it launches one training task on each GPU and wait for all of them to finish before starting the next iteration.该脚本的作用是，对于 for 循环的每次迭代，它会在每个 GPU 上启动一个训练任务，并等待所有这些任务完成，然后再开始下一次迭代。 The issue is that, each of the training task may take different time to run, thus there will be a significant amount of time that I'm using less than 8 GPUs simultaneously, which lengthens the time for whole task to finish.问题是，每个训练任务可能需要不同的时间来运行，因此我会同时使用少于 8 个 GPU 的大量时间，这会延长整个任务完成的时间。

I'm wondering whether there is some way for me to detect which GPU has finished its task and launch a new task on it, so that I can always have 8 GPUs running.我想知道是否有某种方法可以让我检测到哪个 GPU 已完成其任务并在其上启动一个新任务，这样我就可以始终运行 8 个 GPU。

Thanks a lot!非常感谢！

Answer 1

I saw you are not using a cluster, that means the GPUs are on your local machine.我看到您没有使用集群，这意味着 GPU 在您的本地计算机上。

In this case, you can use this library: https://pypi.org/project/simple-gpu-scheduler/在这种情况下，您可以使用这个库： https://pypi.org/project/simple-gpu-scheduler/

Hope this helps.希望这可以帮助。

Answer 2

Just found that Ray is a great package for managing your experiment.刚刚发现 Ray 是一款出色的 package，可用于管理您的实验。 ( https://github.com/ray-project/ray ) （ https://github.com/ray-project/ray ）

如何使用 Bash 脚本保持所有 GPU 设备运行任务？

问题描述

2 个解决方案

解决方案1
1 已采纳 2019-10-30 15:56:55

解决方案2
0 2019-11-01 15:36:27

如何使用 Bash 脚本保持所有 GPU 设备运行任务？

问题描述

2 个解决方案

解决方案1 1 已采纳 2019-10-30 15:56:55

解决方案2 0 2019-11-01 15:36:27

解决方案1
1 已采纳 2019-10-30 15:56:55

解决方案2
0 2019-11-01 15:36:27