[英]Why my training speed in Keras with multi_gpu_model is worse than single gpu?
My Keras version is 2.0.9, and using tensorflow backend. 我的Keras版本是2.0.9,并使用tensorflow后端。
I tried to implement multi_gpu_model in keras. 我试图在keras中实现multi_gpu_model 。 However, training with 4 gpus was even worse than 1 gpu in practice.
但是,在实践中使用4 gpu进行训练甚至比1 gpu还要糟糕。 I got 25sec for 1 gpu, and 50sec for 4 gpus.
1 gpu的时间为25秒,4 gpu的时间为50秒。 Could you give me the reason why this happens?
你能告诉我为什么会这样吗?
/blog for multi_gpu_model / log for multi_gpu_model
https://www.pyimagesearch.com/2017/10/30/how-to-multi-gpu-training-with-keras-python-and-deep-learning/ https://www.pyimagesearch.com/2017/10/30/how-to-multi-gpu-training-with-keras-python-and-deep-learning/
I used this commend for 1 gpu 我用这个推荐1 gpu
CUDA_VISIBLE_DEVICES=0 python gpu_test.py
and for 4 gpus, 4 gpus,
python gpu_test.py
-Here is source code for training. -这里是培训的源代码。
from keras.datasets import mnist
from keras.layers import Input, Dense, merge
from keras.layers.core import Lambda
from keras.models import Model
from keras.utils import to_categorical
from keras.utils.training_utils import multi_gpu_model
import time
(x_train, y_train), (x_test, y_test) = mnist.load_data()
x_train = x_train.reshape(60000, 784)
x_test = x_test.reshape(10000, 784)
y_train = to_categorical(y_train)
y_test = to_categorical(y_test)
inputs = Input(shape=(784,))
x = Dense(4096, activation='relu')(inputs)
x = Dense(2048, activation='relu')(x)
x = Dense(512, activation='relu')(x)
x = Dense(64, activation='relu')(x)
predictions = Dense(10, activation='softmax')(x)
model = Model(inputs=inputs, outputs=predictions)
'''
m_model = multi_gpu_model(model, 4)
m_model.compile(optimizer='rmsprop',
loss='categorical_crossentropy',
metrics=['accuracy'])
m_model.summary()
a=time.time()
m_model.fit(x_train, y_train, batch_size=128, epochs=5)
print time.time() - a
a=time.time()
m_model.predict(x=x_test, batch_size=128)
print time.time() - a
'''
model.compile(optimizer='rmsprop',
loss='categorical_crossentropy',
metrics=['accuracy'])
model.summary()
a=time.time()
model.fit(x_train, y_train, batch_size=128, epochs=5)
print time.time() - a
a=time.time()
model.predict(x=x_test, batch_size=128)
print time.time() - a
I can give you what I think is the answer, but I don't have it fully working myself. 我可以给您我认为的答案,但是我自己无法完全解决问题。 I was tipped off onto this by a bug report , but in the source code for multi_gpu_model it says:
一个错误报告提示了我这一点,但是在multi_gpu_model的源代码中它说:
# Instantiate the base model (or "template" model).
# We recommend doing this with under a CPU device scope,
# so that the model's weights are hosted on CPU memory.
# Otherwise they may end up hosted on a GPU, which would
# complicate weight sharing.
with tf.device('/cpu:0'):
model = Xception(weights=None,
input_shape=(height, width, 3),
classes=num_classes)
I think this is the problem. 我认为这是问题所在。 I'm still working on making it work myself, though.
不过,我仍在努力使它自己工作。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.