简体   繁体   English

Tensorflow / keras multi_gpu_model未拆分为多个GPU

[英]Tensorflow / keras multi_gpu_model is not splitted to more than one gpu

I'm encountered the problem, that I can not successfully split my training batches to more than one GPU. 我遇到了一个问题,即我无法成功地将训练批次拆分为多个GPU。 If multi_gpu_model from tensorflow.keras.utils is used, tensorflow allocates the full memory on all available (for example 2) gpus, but only the first one (gpu[0]) is utilized to 100% if nvidia-smi is watched. 如果multi_gpu_model来自tensorflow.keras.utils ,则tensorflow在所有可用(例如2个)gpu上分配全部内存,但是如果观看nvidia-smi,则仅将第一个(gpu [0])使用到100%。

I'm using tensorflow 1.12 right now. 我现在正在使用tensorflow 1.12。

Test on single device 在单个设备上测试

model = getSimpleCNN(... some parameters)

model .compile()
model .fit()

As expected, data is loaded by cpu and the model runs on gpu[0] with 97% - 100% gpu utilization: 如预期的那样,数据由cpu加载,并且模型在gpu [0]上以97%-100%gpu的利用率运行: 在此处输入图片说明

Create a multi_gpu model 创建一个multi_gpu模型

As described in the tensorflow api for multi_gpu_model here , the device scope for model definition is not changed. 对于multi_gpu_model在tensorflow API如上所述这里 ,对于模型定义所述装置范围被改变。

from tensorflow.keras.utils import multi_gpu_model

model = getSimpleCNN(... some parameters)
parallel_model = multi_gpu_model(model, gpus=2, cpu_merge=False)  # weights merge on GPU (recommended for NV-link)

parallel_model.compile()
parallel_model.fit()

As seen in the timeline, cpu now not only loads the data, but is doing some other calculations. 从时间轴可以看出,cpu现在不仅加载了数据,而且还在进行其他一些计算。 Notice: the second gpu is nearly doing nothing: 注意:第二个GPU几乎什么都不做: 在此处输入图片说明

The question 问题

The effect even worsens as soon as four gpus are used. 一旦使用四个GPU,效果甚至恶化。 Utilization of the first one goes up to 100% but for the rest there are only short peeks. 第一个的利用率高达100%,而其余的只有短暂的窥视。

Is there any solution to fix this? 有解决此问题的解决方案吗? How to properly train on multiple gpus? 如何正确训练多个GPU?

Is there any difference between tensorflow.keras.utils and keras.utils which causes the unexpected behavior? tensorflow.keras.utilskeras.utils之间是否存在导致意外行为的差异?

I just ran into the same issue. 我只是遇到了同样的问题。 In my case, the problem came from the use of a build_model(... parameters) function that returned the model. 就我而言,问题出在使用返回模型的build_model(... parameters)函数。 Be careful with your getSimpleCNN() function, as I don't know what is in it my best advice is to build the model sequentially in your code without using this function. 请谨慎使用getSimpleCNN()函数,因为我不知道其中包含什么,所以最好的建议是不使用此函数而在代码中按顺序构建模型。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM