在 keras 中使用 multi_gpu_model 时出现值错误

Question

I am using google cloud VM with 4 Tesla K80 GPU's.我正在使用带有 4 个 Tesla K80 GPU 的谷歌云 VM。

I am running a keras model using multi_gpu_model with gpus=4(since i have 4 gpu's).我正在使用 multi_gpu_model 和 gpu=4 运行 keras 模型（因为我有 4 个 gpu）。 But, i am getting the following error但是，我收到以下错误

ValueError: To call multi_gpu_model with gpus=4 , we expect the following devices to be available: ['/cpu:0', '/gpu:0', '/gpu:1', '/gpu:2', '/gpu:3']. ValueError：要使用gpus=4调用multi_gpu_model ，我们希望以下设备可用：['/cpu:0', '/gpu:0', '/gpu:1', '/gpu:2', '/ gpu:3']。 However this machine only has: ['/cpu:0', '/xla_cpu:0', '/xla_gpu:0', '/gpu:0'].然而这台机器只有：['/cpu:0', '/xla_cpu:0', '/xla_gpu:0', '/gpu:0']。 Try reducing gpus .尝试减少gpus 。

I can see that there are only two gpu's here namely '/xla_gpu:0', '/gpu:0' .我可以看到这里只有两个 gpu，即'/xla_gpu:0', '/gpu:0' 。 so, i tried with gpus = 2 and again got the following error所以，我尝试使用gpus = 2并再次收到以下错误

ValueError: To call multi_gpu_model with gpus=2 , we expect the following devices to be available: ['/cpu:0', '/gpu:0', '/gpu:1']. ValueError：要使用gpus=2调用multi_gpu_model ，我们希望以下设备可用：['/cpu:0', '/gpu:0', '/gpu:1']。 However this machine only has: ['/cpu:0', '/xla_cpu:0', '/xla_gpu:0', '/gpu:0'].然而这台机器只有：['/cpu:0', '/xla_cpu:0', '/xla_gpu:0', '/gpu:0']。 Try reducing gpus .尝试减少gpus 。

can anyone help me out with the error.谁能帮我解决这个错误。 Thanks!谢谢！

Answer 1

It looks like Keras only sees one of the GPUs.看起来 Keras 只能看到其中一个 GPU。

Make sure that all 4 GPUs are accessible, you can use device_lib with TensorFlow.确保所有 4 个 GPU 均可访问，您可以将device_lib与 TensorFlow 结合使用。

from tensorflow.python.client import device_lib

def get_available_gpus():
    local_device_protos = device_lib.list_local_devices()
    return [x.name for x in local_device_protos if x.device_type == 'GPU']

You might need to manually install or update GPU drivers on your instance.您可能需要在实例上手动安装或更新 GPU 驱动程序。 Consult here . 在这里咨询。

Answer 2

TensorFlow is only seeing one GPU (the gpu and xla_gpu devices are two backends over the same physical device). TensorFlow 只能看到一个 GPU（gpu 和 xla_gpu 设备是同一物理设备上的两个后端）。 Are you setting CUDA_VISIBLE_DEVICES?你在设置 CUDA_VISIBLE_DEVICES 吗？ Does nvidia-smi show all GPUs? nvidia-smi 是否显示所有 GPU？

Answer 3

I had the same issue and I think I figured out a way around it.我遇到了同样的问题，我想我找到了解决方法。 In my case, I am working on an HPC and I intalled keras on my /.local, whereas Tensorflow and CUDA are installed by the IT staff, anyway I encountered the same error above.就我而言，我正在使用 HPC 并在我的 /.local 上安装了 keras，而 Tensorflow 和 CUDA 是由 IT 人员安装的，无论如何我遇到了上述相同的错误。 I am using Tensorflow==1.15.0 and Keras==2.3.1我正在使用 Tensorflow==1.15.0 和 Keras==2.3.1

I noticed that the message error:我注意到消息错误：

ValueError: To call multi_gpu_model with gpus=2, we expect the following devices to be available: ['/cpu:0', '/gpu:0', '/gpu:1']. ValueError：要使用 gpu=2 调用 multi_gpu_model，我们希望以下设备可用：['/cpu:0'、'/gpu:0'、'/gpu:1']。 However this machine only has: ['/cpu:0', '/xla_cpu:0', '/xla_gpu:0', '/xla_gpu:1'].然而这台机器只有：['/cpu:0', '/xla_cpu:0', '/xla_gpu:0', '/xla_gpu:1']。 Try reducing gpus.尝试减少GPU。

is located in the following file of keras, line 184:位于以下 keras 文件中，第 184 行：

/home/.local/lib/python3.7/site-packages/keras/utils/multi_gpu_utils.py

I solved this, by replacing the line 175 with the following:我解决了这个问题，将第 175 行替换为以下内容：

target_devices = ['/cpu:0'] + ['/gpu:%d' % i for i in target_gpu_ids] (before)
target_devices = ['/cpu:0'] + ['/xla_gpu:%d' % i for i in target_gpu_ids] (after)

Moreover, I modified the following keras file:此外，我修改了以下 keras 文件：

/home/.local/lib/python3.7/site-packages/keras/backend/tensorflow_backend.py

so I replaced the line 510 with:所以我用以下代码替换了 510 行：

return [x for x in _LOCAL_DEVICES if 'device:gpu' in x.lower()] (before)
return [x for x in _LOCAL_DEVICES if 'device:XLA_GPU' in x] (after)

Long story short only to say that apparently this is a bug of Keras and not of some environmental setup.长话短说只是想说这显然是 Keras 的错误，而不是某些环境设置的错误。 After such modification my network was able to run with the xla_gpus, I hope this is somehow helpful.经过这样的修改，我的网络能够与 xla_gpus 一起运行，我希望这在某种程度上有所帮助。

Answer 4

You can check all device list using following code:您可以使用以下代码检查所有设备列表：

from tensorflow.python.client import device_lib
device_lib.list_local_devices()

Answer 5

This can be caused by using tensorflow instead of tensorflow-gpu .这可能是由于使用tensorflow而不是tensorflow-gpu 。

One way to fix this is the following:解决此问题的一种方法如下：

$ pip uninstall tensorflow
$ pip install tensorflow-gpu

More information can be found here: https://stackoverflow.com/a/42652258/6543020更多信息可以在这里找到： https : //stackoverflow.com/a/42652258/6543020

Answer 6

I had the same issue.我遇到过同样的问题。 Tensorflow-gpu 1.14 installed, CUDA 10.0 , and 4 XLA_GPUs were displayed with device_lib.list_local_devices() .已安装Tensorflow-gpu 1.14 、 CUDA 10.0和4 个 XLA_GPU与device_lib.list_local_devices()一起显示。

I have another conda environement and there is just Tensorflow 1.14 installed and no tensorflow-gpu, and i don't know why, but i can run my multi_gpu model on all gpus with that environment.我有另一个conda 环境，只安装了Tensorflow 1.14而没有 tensorflow-gpu，我不知道为什么，但我可以在具有该环境的所有 gpu 上运行我的 multi_gpu 模型。

在 keras 中使用 multi_gpu_model 时出现值错误

问题描述

6 个解决方案

解决方案1
4 已采纳 2018-10-23 14:22:32

解决方案2
1 2018-10-23 20:18:31

解决方案3
1 2020-08-06 10:25:01

解决方案4
0 2019-04-19 09:25:35

解决方案5
0 2019-05-17 22:44:27

解决方案6
0 2019-10-07 16:31:49

在 keras 中使用 multi_gpu_model 时出现值错误

问题描述

6 个解决方案

解决方案1 4 已采纳 2018-10-23 14:22:32

解决方案2 1 2018-10-23 20:18:31

解决方案3 1 2020-08-06 10:25:01

解决方案4 0 2019-04-19 09:25:35

解决方案5 0 2019-05-17 22:44:27

解决方案6 0 2019-10-07 16:31:49

解决方案1
4 已采纳 2018-10-23 14:22:32

解决方案2
1 2018-10-23 20:18:31

解决方案3
1 2020-08-06 10:25:01

解决方案4
0 2019-04-19 09:25:35

解决方案5
0 2019-05-17 22:44:27

解决方案6
0 2019-10-07 16:31:49