[英]Error on prediction running keras multi_gpu_model
I've an issue running a Keras model on a Google Cloud Platform instance.我在 Google Cloud Platform 实例上运行 Keras model 时遇到问题。
The model is the following: model 如下:
n_timesteps, n_features, n_outputs = train_x.shape[1], train_x.shape[2], train_y.shape[1]
train_y = train_y.reshape((train_y.shape[0], train_y.shape[1], 1))
verbose, epochs, batch_size = 1, 1, 64 # low number of epochs just for testing purpose
with tf.device('/cpu:0'):
m = Sequential()
m.add(CuDNNLSTM(20, input_shape=(n_timesteps, n_features)))
m.add(LeakyReLU(alpha=0.1))
m.add(RepeatVector(n_outputs))
m.add(CuDNNLSTM(20, return_sequences=True))
m.add(LeakyReLU(alpha=0.1))
m.add(TimeDistributed(Dense(20)))
m.add(LeakyReLU(alpha=0.1))
m.add(TimeDistributed(Dense(1)))
self.model = multi_gpu_model(m, gpus=8)
self.model.compile(loss='mse', optimizer='adam')
self.model.fit(train_x, train_y, epochs=epochs, batch_size=batch_size, verbose=verbose)
As you can see from the code above, I run the model on machine with 8 GPUs (Nvidia Tesla K80).正如您从上面的代码中看到的,我在具有 8 个 GPU(Nvidia Tesla K80)的机器上运行 model。
Train works well, without any errors.火车运行良好,没有任何错误。 However, the prediction fails and returns the following error:
但是,预测失败并返回以下错误:
W tensorflow/core/framework/op_kernel.cc:1502] OP_REQUIRES failed at cudnn_rnn_ops.cc:1336: Unknown: CUDNN_STATUS_BAD_PARAM in tensorflow/stream_executor/cuda/cuda_dnn.cc(1285): 'cudnnSetTensorNdDescriptor( tensor_desc.get(), data_type, sizeof(dims) / sizeof(dims[0]), dims, strides)'
W tensorflow/core/framework/op_kernel.cc:1502] OP_REQUIRES 在 cudnn_rnn_ops.cc:1336 处失败:未知:tensorflow/stream_executor/cuda/cuda_dnn.cc(1285) 中的 CUDNN_STATUS_BAD_PARAM:'cudnnSetTensorNdDescriptor(tensor_desc.get(),data_type, sizeof(dims) / sizeof(dims[0]), dims, strides)'
Here the code to run the prediction:这里是运行预测的代码:
self.model.predict(input_x)
What I've noticed is that if I remove the code for multi-GPU data parallelism, the code works well using a single GPU.我注意到的是,如果我删除多 GPU 数据并行的代码,则代码使用单个 GPU 运行良好。
To be more precise, if I comment this line, the code works without error更准确地说,如果我评论这一行,代码可以正常工作
self.model = multi_gpu_model(m, gpus=8)
What am I missing?我错过了什么?
virtualenv information虚拟环境信息
cudatoolkit - 10.0.130 cudatoolkit - 10.0.130
cudnn - 7.6.4 cudnn - 7.6.4
keras - 2.2.4 keras - 2.2.4
keras-applications - 1.0.8 keras 应用程序 - 1.0.8
keras-base - 2.2.4 keras-base - 2.2.4
keras-gpu - 2.2.4 keras GPU - 2.2.4
python - 3.6 python - 3.6
UPDATE更新
train_x.shape = (1441, 288, 1)
train_y.shape = (1441, 288, 1)
input_x.shape = (1, 288, 1)
After Olivier Dehaene's reply I tried his suggestion and it worked.在 Olivier Dehaene 的回复之后,我尝试了他的建议并且成功了。
I tried to modify the input_x shape in order to obtain (8, 288, 1).我试图修改 input_x 形状以获得 (8, 288, 1)。
In order to do that I also modified train_x and train_y shapes.为了做到这一点,我还修改了 train_x 和 train_y 形状。
Here a recap:这里回顾一下:
train_x.shape = (8065, 288, 1)
train_y.shape = (8065, 288, 1)
input_x.shape = (8, 288, 1)
But now I've the same error on the training phase, on this line:但是现在我在训练阶段遇到了同样的错误,在这一行:
self.model.fit(train_x, train_y, epochs=epochs, batch_size=batch_size, verbose=verbose)
From the tf.keras.utils.multi_gpu_model
we can see that it works in the following way:从
tf.keras.utils.multi_gpu_model
我们可以看到它的工作方式如下:
- Divide the model's input(s) into multiple sub-batches.
将模型的输入分成多个子批次。
- Apply a model copy on each sub-batch.
在每个子批次上应用 model 副本。 Every model copy is executed on a dedicated GPU.
每个 model 副本都在专用 GPU 上执行。
- Concatenate the results (on CPU) into one big batch.
将结果(在 CPU 上)连接成一个大批次。
You are triggering an error because the input of the CuDNNLSTM
layer is empty for at least one of the model copy.您正在触发错误,因为对于 model 副本中的至少一个,
CuDNNLSTM
层的输入为空。 This is because the divide operations requires that: input // n_gpus > 0
这是因为除法运算要求:
input // n_gpus > 0
Try this code out:试试这个代码:
input_x = np.random.randn(8, n_timesteps, n_features)
model.predict(input_x)
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.