简体   繁体   English

如何使用同一个 GPU 设备在 SLURM 中定义多个 gres 资源?

[英]How to define multiple gres resources in SLURM using the same GPU device?

I'm running machine learning (ML) jobs that make use of very little GPU memory.我正在运行使用非常少的 GPU memory 的机器学习 (ML) 作业。 Thus, I could run multiple ML jobs on a single GPU.因此,我可以在单个 GPU 上运行多个 ML 作业。

To achieve that, I would like to add multiple lines in the gres.conf file that specify the same device.为此,我想在 gres.conf 文件中添加多行来指定相同的设备。 However, it seems the slurm deamon doesn't accept this, the service returning:但是,似乎 slurm 守护进程不接受这一点,服务返回:

fatal: Gres GPU plugin failed to load configuration

Is there any option I'm missing to make this work?我是否缺少任何选项来完成这项工作?

Or maybe a different way to achieve that with SLURM?或者也许是使用 SLURM 实现这一目标的不同方式?

It is kind of smiliar to this one, but this one seems specific to some CUDA code with compilation enabled.它与这个有点相似,但这个似乎特定于某些启用编译的 CUDA 代码。 Something which seems way more specific than my general case (or at least as far as I understand).似乎比我的一般情况(或至少据我了解)更具体的东西。 How to run multiple jobs on a GPU grid with CUDA using SLURM 如何使用 SLURM 在带有 CUDA 的 GPU 网格上运行多个作业

I don't think you can oversubscribe GPUs, so I see two options:我不认为你可以超额订阅 GPU,所以我看到了两个选项:

  1. You can configure the CUDA Multi-Process Service or您可以配置CUDA 多进程服务
  2. pack multiple calculations into a single job that has one GPU and run them in parallel.将多个计算打包到具有一个 GPU 的单个作业中并并行运行它们。

Besides nVidia MPS mentioned by @Marcus Boden, which is relevant for V100 types of cards, there also is Multi-Instance GPU which is relevant for A100 types of cards.除了@Marcus Boden 提到的与V100 类型卡相关的nVidia MPS,还有与A100 类型卡相关的Multi-Instance GPU

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 如何在 GPU 上定义具有多种返回类型的 tf.map_fn? - How to define tf.map_fn with multiple return types on GPU? Keras/Tensorflow:在同一个 GPU 上循环或使用 Process 训练多个模型 - Keras/Tensorflow: Train multiple models on the same GPU in a loop or using Process tensorflow同时使用2个GPU - tensorflow using 2 GPU at the same time 在 Tensorflow 2.3 和 Keras 中使用具有多个嵌入输入的 GPU 时无法分配设备进行操作 - Cannot assign a device for operation when using GPU with multiple embedding inputs in Tensorflow 2.3 with Keras 具有多个gpu的TensorFlow XLA不会同时使用GPU - TensorFlow XLA with multiple gpu does not use GPU at the same time 聪明人,如何选择单GPU设备? - Cleverhans, how to select single GPU device? 在同一GPU上运行多个tensorflow进程是不安全的吗? - Is it unsafe to run multiple tensorflow processes on the same GPU? 如何使用单个 GPU 在 tensorflow python 中同时运行多个模型? - How can I use single GPU to run multiple models at the same time in tensorflow python? 当设备设置为CPU时,为什么TensorFlow使用我的GPU - Why is TensorFlow using my GPU when the device is set to the CPU Tensorflow with GPU,如何查看tensorflow是否使用GPU? - Tensorflow with GPU, how to see tensorflow is using the GPU?
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM