Tensorflow / keras multi_gpu_model未拆分为多个GPU

Question

I'm encountered the problem, that I can not successfully split my training batches to more than one GPU. 我遇到了一个问题，即我无法成功地将训练批次拆分为多个GPU。 If multi_gpu_model from tensorflow.keras.utils is used, tensorflow allocates the full memory on all available (for example 2) gpus, but only the first one (gpu[0]) is utilized to 100% if nvidia-smi is watched. 如果multi_gpu_model来自tensorflow.keras.utils ，则tensorflow在所有可用（例如2个）gpu上分配全部内存，但是如果观看nvidia-smi，则仅将第一个（gpu [0]）使用到100％。

I'm using tensorflow 1.12 right now. 我现在正在使用tensorflow 1.12。

Test on single device 在单个设备上测试

model = getSimpleCNN(... some parameters)

model .compile()
model .fit()

As expected, data is loaded by cpu and the model runs on gpu[0] with 97% - 100% gpu utilization: 如预期的那样，数据由cpu加载，并且模型在gpu [0]上以97％-100％gpu的利用率运行：

Create a multi_gpu model 创建一个multi_gpu模型

As described in the tensorflow api for multi_gpu_model here , the device scope for model definition is not changed. 对于multi_gpu_model在tensorflow API如上所述这里，对于模型定义所述装置范围不被改变。

from tensorflow.keras.utils import multi_gpu_model

model = getSimpleCNN(... some parameters)
parallel_model = multi_gpu_model(model, gpus=2, cpu_merge=False)  # weights merge on GPU (recommended for NV-link)

parallel_model.compile()
parallel_model.fit()

As seen in the timeline, cpu now not only loads the data, but is doing some other calculations. 从时间轴可以看出，cpu现在不仅加载了数据，而且还在进行其他一些计算。 Notice: the second gpu is nearly doing nothing: 注意：第二个GPU几乎什么都不做：

The question 问题

The effect even worsens as soon as four gpus are used. 一旦使用四个GPU，效果甚至恶化。 Utilization of the first one goes up to 100% but for the rest there are only short peeks. 第一个的利用率高达100％，而其余的只有短暂的窥视。

Is there any solution to fix this? 有解决此问题的解决方案吗？ How to properly train on multiple gpus? 如何正确训练多个GPU？

Is there any difference between tensorflow.keras.utils and keras.utils which causes the unexpected behavior? tensorflow.keras.utils和keras.utils之间是否存在导致意外行为的差异？

Answer 1

I just ran into the same issue. 我只是遇到了同样的问题。 In my case, the problem came from the use of a build_model(... parameters) function that returned the model. 就我而言，问题出在使用返回模型的build_model(... parameters)函数。 Be careful with your getSimpleCNN() function, as I don't know what is in it my best advice is to build the model sequentially in your code without using this function. 请谨慎使用getSimpleCNN()函数，因为我不知道其中包含什么，所以最好的建议是不使用此函数而在代码中按顺序构建模型。

Tensorflow / keras multi_gpu_model未拆分为多个GPU

问题描述

Test on single device 在单个设备上测试

Create a multi_gpu model 创建一个multi_gpu模型

The question 问题

1 个解决方案

解决方案1
0 2019-04-26 14:26:13

Tensorflow / keras multi_gpu_model未拆分为多个GPU

问题描述

Test on single device 在单个设备上测试

Create a multi_gpu model 创建一个multi_gpu模型

The question 问题

1 个解决方案

解决方案1 0 2019-04-26 14:26:13

解决方案1
0 2019-04-26 14:26:13