简体   繁体   English

如何让 Keras 在特定 GPU 上训练模型?

[英]How do I get Keras to train a model on a specific GPU?

There is a shared server with 2 GPUs in my institution.我所在的机构有一台带有 2 个 GPU 的共享服务器。 Suppose there are two team members each wants to train a model at the same time, then how do they get Keras to train their model on a specific GPU so as to avoid resource conflict?假设有两个团队成员都想同时训练一个模型,那么他们如何让 Keras 在特定的 GPU 上训练他们的模型以避免资源冲突?

Ideally, Keras should figure out which GPU is currently busy training a model and then use the other GPU to train the other model.理想情况下,Keras 应该找出当前哪个 GPU 正在忙于训练模型,然后使用另一个 GPU 来训练另一个模型。 However, this doesn't seem to be the case.然而,情况似乎并非如此。 It seems that by default Keras only uses the first GPU (since the Volatile GPU-Util of the second GPU is always 0%).默认情况下,Keras 似乎只使用第一个 GPU(因为第二个 GPU 的Volatile GPU-Util始终为 0%)。

在此处输入图片说明

Possibly duplicate with my previous question可能与我之前的问题重复

It's a bit more complicated.它有点复杂。 Keras will the memory in both GPUs althugh it will only use one GPU by default. Keras 将使用两个 GPU 中的内存,尽管默认情况下它只会使用一个 GPU。 Check keras.utils.multi_gpu_model for using several GPUs.检查keras.utils.multi_gpu_model以使用多个 GPU。

I found the solution by choosing the GPU using the environment variable CUDA_VISIBLE_DEVICES.我通过使用环境变量 CUDA_VISIBLE_DEVICES 选择 GPU 找到了解决方案。

You can add this manually before importing keras or tensorflow to choose your gpu您可以在导入 keras 或 tensorflow 之前手动添加它以选择您的 gpu

os.environ["CUDA_VISIBLE_DEVICES"]="0" # first gpu
os.environ["CUDA_VISIBLE_DEVICES"]="1" # second gpu
os.environ["CUDA_VISIBLE_DEVICES"] = "-1" # runs in cpu

To make it automatically, I made a function that parses nvidia-smi and detects automatically which GPU is being already used and sets the appropriate value to the variable.为了自动实现,我创建了一个函数来解析nvidia-smi并自动检测哪个 GPU 已经被使用,并为变量设置适当的值。

如果您使用的是训练脚本,您可以在调用脚本之前简单地在命令行中设置它

CUDA_VISIBLE_DEVICES=1 python train.py 

If you want to train models on cloud GPUs (eg GPU instances from AWS), try this library:如果你想在云 GPU 上训练模型(例如来自 AWS 的 GPU 实例),试试这个库:

!pip install aibro==0.0.45 --extra-index-url https://test.pypi.org/simple

from aibro.train import fit
machine_id = 'g4dn.4xlarge' #instance name on AWS
job_id, trained_model, history = fit(
    model=model,
    train_X=train_X,
    train_Y=train_Y,
    validation_data=(validation_X, validation_Y),
    machine_id=machine_id
)

Tutorial: https://colab.research.google.com/drive/19sXZ4kbic681zqEsrl_CZfB5cegUwuIB#scrollTo=ERqoHEaamR1Y教程: https : //colab.research.google.com/drive/19sXZ4kbic681zqEsrl_CZfB5cegUwuIB#scrollTo=ERqoHEaamR1Y

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM