简体   繁体   English

在 keras 中使用 multi_gpu_model 时出现值错误

[英]valueError when using multi_gpu_model in keras

I am using google cloud VM with 4 Tesla K80 GPU's.我正在使用带有 4 个 Tesla K80 GPU 的谷歌云 VM。

I am running a keras model using multi_gpu_model with gpus=4(since i have 4 gpu's).我正在使用 multi_gpu_model 和 gpu=4 运行 keras 模型(因为我有 4 个 gpu)。 But, i am getting the following error但是,我收到以下错误

ValueError: To call multi_gpu_model with gpus=4 , we expect the following devices to be available: ['/cpu:0', '/gpu:0', '/gpu:1', '/gpu:2', '/gpu:3']. ValueError:要使用gpus=4调用multi_gpu_model ,我们希望以下设备可用:['/cpu:0', '/gpu:0', '/gpu:1', '/gpu:2', '/ gpu:3']。 However this machine only has: ['/cpu:0', '/xla_cpu:0', '/xla_gpu:0', '/gpu:0'].然而这台机器只有:['/cpu:0', '/xla_cpu:0', '/xla_gpu:0', '/gpu:0']。 Try reducing gpus .尝试减少gpus

I can see that there are only two gpu's here namely '/xla_gpu:0', '/gpu:0' .我可以看到这里只有两个 gpu,即'/xla_gpu:0', '/gpu:0' so, i tried with gpus = 2 and again got the following error所以,我尝试使用gpus = 2并再次收到以下错误

ValueError: To call multi_gpu_model with gpus=2 , we expect the following devices to be available: ['/cpu:0', '/gpu:0', '/gpu:1']. ValueError:要使用gpus=2调用multi_gpu_model ,我们希望以下设备可用:['/cpu:0', '/gpu:0', '/gpu:1']。 However this machine only has: ['/cpu:0', '/xla_cpu:0', '/xla_gpu:0', '/gpu:0'].然而这台机器只有:['/cpu:0', '/xla_cpu:0', '/xla_gpu:0', '/gpu:0']。 Try reducing gpus .尝试减少gpus

can anyone help me out with the error.谁能帮我解决这个错误。 Thanks!谢谢!

It looks like Keras only sees one of the GPUs.看起来 Keras 只能看到其中一个 GPU。

Make sure that all 4 GPUs are accessible, you can use device_lib with TensorFlow.确保所有 4 个 GPU 均可访问,您可以将device_lib与 TensorFlow 结合使用。

from tensorflow.python.client import device_lib

def get_available_gpus():
    local_device_protos = device_lib.list_local_devices()
    return [x.name for x in local_device_protos if x.device_type == 'GPU']

You might need to manually install or update GPU drivers on your instance.您可能需要在实例上手动安装或更新 GPU 驱动程序。 Consult here . 在这里咨询。

TensorFlow is only seeing one GPU (the gpu and xla_gpu devices are two backends over the same physical device). TensorFlow 只能看到一个 GPU(gpu 和 xla_gpu 设备是同一物理设备上的两个后端)。 Are you setting CUDA_VISIBLE_DEVICES?你在设置 CUDA_VISIBLE_DEVICES 吗? Does nvidia-smi show all GPUs? nvidia-smi 是否显示所有 GPU?

I had the same issue and I think I figured out a way around it.我遇到了同样的问题,我想我找到了解决方法。 In my case, I am working on an HPC and I intalled keras on my /.local, whereas Tensorflow and CUDA are installed by the IT staff, anyway I encountered the same error above.就我而言,我正在使用 HPC 并在我的 /.local 上安装了 keras,而 Tensorflow 和 CUDA 是由 IT 人员安装的,无论如何我遇到了上述相同的错误。 I am using Tensorflow==1.15.0 and Keras==2.3.1我正在使用 Tensorflow==1.15.0 和 Keras==2.3.1

I noticed that the message error:我注意到消息错误:

ValueError: To call multi_gpu_model with gpus=2, we expect the following devices to be available: ['/cpu:0', '/gpu:0', '/gpu:1']. ValueError:要使用 gpu=2 调用 multi_gpu_model,我们希望以下设备可用:['/cpu:0'、'/gpu:0'、'/gpu:1']。 However this machine only has: ['/cpu:0', '/xla_cpu:0', '/xla_gpu:0', '/xla_gpu:1'].然而这台机器只有:['/cpu:0', '/xla_cpu:0', '/xla_gpu:0', '/xla_gpu:1']。 Try reducing gpus.尝试减少GPU。

is located in the following file of keras, line 184:位于以下 keras 文件中,第 184 行:

/home/.local/lib/python3.7/site-packages/keras/utils/multi_gpu_utils.py

I solved this, by replacing the line 175 with the following:我解决了这个问题,将第 175 行替换为以下内容:

target_devices = ['/cpu:0'] + ['/gpu:%d' % i for i in target_gpu_ids] (before)
target_devices = ['/cpu:0'] + ['/xla_gpu:%d' % i for i in target_gpu_ids] (after)

Moreover, I modified the following keras file:此外,我修改了以下 keras 文件:

/home/.local/lib/python3.7/site-packages/keras/backend/tensorflow_backend.py

so I replaced the line 510 with:所以我用以下代码替换了 510 行:

return [x for x in _LOCAL_DEVICES if 'device:gpu' in x.lower()] (before)
return [x for x in _LOCAL_DEVICES if 'device:XLA_GPU' in x] (after)

Long story short only to say that apparently this is a bug of Keras and not of some environmental setup.长话短说只是想说这显然是 Keras 的错误,而不是某些环境设置的错误。 After such modification my network was able to run with the xla_gpus, I hope this is somehow helpful.经过这样的修改,我的网络能够与 xla_gpus 一起运行,我希望这在某种程度上有所帮助。

You can check all device list using following code:您可以使用以下代码检查所有设备列表:

from tensorflow.python.client import device_lib
device_lib.list_local_devices()

This can be caused by using tensorflow instead of tensorflow-gpu .这可能是由于使用tensorflow而不是tensorflow-gpu

One way to fix this is the following:解决此问题的一种方法如下:

$ pip uninstall tensorflow
$ pip install tensorflow-gpu

More information can be found here: https://stackoverflow.com/a/42652258/6543020更多信息可以在这里找到: https : //stackoverflow.com/a/42652258/6543020

I had the same issue.我遇到过同样的问题。 Tensorflow-gpu 1.14 installed, CUDA 10.0 , and 4 XLA_GPUs were displayed with device_lib.list_local_devices() .已安装Tensorflow-gpu 1.14CUDA 10.04 个 XLA_GPUdevice_lib.list_local_devices()一起显示。

I have another conda environement and there is just Tensorflow 1.14 installed and no tensorflow-gpu, and i don't know why, but i can run my multi_gpu model on all gpus with that environment.我有另一个conda 环境,只安装了Tensorflow 1.14而没有 tensorflow-gpu,我不知道为什么,但我可以在具有该环境的所有 gpu 上运行我的 multi_gpu 模型。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM