[英]Tensorflow / keras multi_gpu_model is not splitted to more than one gpu
I'm encountered the problem, that I can not successfully split my training batches to more than one GPU. 我遇到了一个问题,即我无法成功地将训练批次拆分为多个GPU。 If
multi_gpu_model
from tensorflow.keras.utils
is used, tensorflow allocates the full memory on all available (for example 2) gpus, but only the first one (gpu[0]) is utilized to 100% if nvidia-smi is watched. 如果
multi_gpu_model
来自tensorflow.keras.utils
,则tensorflow在所有可用(例如2个)gpu上分配全部内存,但是如果观看nvidia-smi,则仅将第一个(gpu [0])使用到100%。
I'm using tensorflow 1.12 right now. 我现在正在使用tensorflow 1.12。
model = getSimpleCNN(... some parameters)
model .compile()
model .fit()
As expected, data is loaded by cpu and the model runs on gpu[0] with 97% - 100% gpu utilization: 如预期的那样,数据由cpu加载,并且模型在gpu [0]上以97%-100%gpu的利用率运行:
As described in the tensorflow api for multi_gpu_model here , the device scope for model definition is not changed. 对于multi_gpu_model在tensorflow API如上所述这里 ,对于模型定义所述装置范围不被改变。
from tensorflow.keras.utils import multi_gpu_model
model = getSimpleCNN(... some parameters)
parallel_model = multi_gpu_model(model, gpus=2, cpu_merge=False) # weights merge on GPU (recommended for NV-link)
parallel_model.compile()
parallel_model.fit()
As seen in the timeline, cpu now not only loads the data, but is doing some other calculations. 从时间轴可以看出,cpu现在不仅加载了数据,而且还在进行其他一些计算。 Notice: the second gpu is nearly doing nothing:
注意:第二个GPU几乎什么都不做:
The effect even worsens as soon as four gpus are used. 一旦使用四个GPU,效果甚至恶化。 Utilization of the first one goes up to 100% but for the rest there are only short peeks.
第一个的利用率高达100%,而其余的只有短暂的窥视。
Is there any solution to fix this? 有解决此问题的解决方案吗? How to properly train on multiple gpus?
如何正确训练多个GPU?
Is there any difference between tensorflow.keras.utils
and keras.utils
which causes the unexpected behavior? tensorflow.keras.utils
和keras.utils
之间是否存在导致意外行为的差异?
I just ran into the same issue. 我只是遇到了同样的问题。 In my case, the problem came from the use of a
build_model(... parameters)
function that returned the model. 就我而言,问题出在使用返回模型的
build_model(... parameters)
函数。 Be careful with your getSimpleCNN()
function, as I don't know what is in it my best advice is to build the model sequentially in your code without using this function. 请谨慎使用
getSimpleCNN()
函数,因为我不知道其中包含什么,所以最好的建议是不使用此函数而在代码中按顺序构建模型。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.