简体   繁体   English

运行预测错误 keras multi_gpu_model

[英]Error on prediction running keras multi_gpu_model

I've an issue running a Keras model on a Google Cloud Platform instance.我在 Google Cloud Platform 实例上运行 Keras model 时遇到问题。
The model is the following: model 如下:

n_timesteps, n_features, n_outputs = train_x.shape[1], train_x.shape[2], train_y.shape[1]

train_y = train_y.reshape((train_y.shape[0], train_y.shape[1], 1))

verbose, epochs, batch_size = 1, 1, 64  # low number of epochs just for testing purpose
with tf.device('/cpu:0'):
    m = Sequential()
    m.add(CuDNNLSTM(20, input_shape=(n_timesteps, n_features)))
    m.add(LeakyReLU(alpha=0.1))
    m.add(RepeatVector(n_outputs))
    m.add(CuDNNLSTM(20, return_sequences=True))
    m.add(LeakyReLU(alpha=0.1))
    m.add(TimeDistributed(Dense(20)))
    m.add(LeakyReLU(alpha=0.1))
    m.add(TimeDistributed(Dense(1)))

self.model = multi_gpu_model(m, gpus=8)
self.model.compile(loss='mse', optimizer='adam')

self.model.fit(train_x, train_y, epochs=epochs, batch_size=batch_size, verbose=verbose)

As you can see from the code above, I run the model on machine with 8 GPUs (Nvidia Tesla K80).正如您从上面的代码中看到的,我在具有 8 个 GPU(Nvidia Tesla K80)的机器上运行 model。
Train works well, without any errors.火车运行良好,没有任何错误。 However, the prediction fails and returns the following error:但是,预测失败并返回以下错误:

W tensorflow/core/framework/op_kernel.cc:1502] OP_REQUIRES failed at cudnn_rnn_ops.cc:1336: Unknown: CUDNN_STATUS_BAD_PARAM in tensorflow/stream_executor/cuda/cuda_dnn.cc(1285): 'cudnnSetTensorNdDescriptor( tensor_desc.get(), data_type, sizeof(dims) / sizeof(dims[0]), dims, strides)' W tensorflow/core/framework/op_kernel.cc:1502] OP_REQUIRES 在 cudnn_rnn_ops.cc:1336 处失败:未知:tensorflow/stream_executor/cuda/cuda_dnn.cc(1285) 中的 CUDNN_STATUS_BAD_PARAM:'cudnnSetTensorNdDescriptor(tensor_desc.get(),data_type, sizeof(dims) / sizeof(dims[0]), dims, strides)'

Here the code to run the prediction:这里是运行预测的代码:

self.model.predict(input_x)

What I've noticed is that if I remove the code for multi-GPU data parallelism, the code works well using a single GPU.我注意到的是,如果我删除多 GPU 数据并行的代码,则代码使用单个 GPU 运行良好。
To be more precise, if I comment this line, the code works without error更准确地说,如果我评论这一行,代码可以正常工作

self.model = multi_gpu_model(m, gpus=8)

What am I missing?我错过了什么?

virtualenv information虚拟环境信息

cudatoolkit - 10.0.130 cudatoolkit - 10.0.130
cudnn - 7.6.4 cudnn - 7.6.4
keras - 2.2.4 keras - 2.2.4
keras-applications - 1.0.8 keras 应用程序 - 1.0.8
keras-base - 2.2.4 keras-base - 2.2.4
keras-gpu - 2.2.4 keras GPU - 2.2.4
python - 3.6 python - 3.6

UPDATE更新

train_x.shape = (1441, 288, 1)
train_y.shape = (1441, 288, 1)
input_x.shape = (1, 288, 1)

After Olivier Dehaene's reply I tried his suggestion and it worked.在 Olivier Dehaene 的回复之后,我尝试了他的建议并且成功了。
I tried to modify the input_x shape in order to obtain (8, 288, 1).我试图修改 input_x 形状以获得 (8, 288, 1)。
In order to do that I also modified train_x and train_y shapes.为了做到这一点,我还修改了 train_x 和 train_y 形状。
Here a recap:这里回顾一下:

train_x.shape = (8065, 288, 1)
train_y.shape = (8065, 288, 1)
input_x.shape = (8, 288, 1)

But now I've the same error on the training phase, on this line:但是现在我在训练阶段遇到了同样的错误,在这一行:

self.model.fit(train_x, train_y, epochs=epochs, batch_size=batch_size, verbose=verbose)

From the tf.keras.utils.multi_gpu_model we can see that it works in the following way:tf.keras.utils.multi_gpu_model我们可以看到它的工作方式如下:

  • Divide the model's input(s) into multiple sub-batches.将模型的输入分成多个子批次。
  • Apply a model copy on each sub-batch.在每个子批次上应用 model 副本。 Every model copy is executed on a dedicated GPU.每个 model 副本都在专用 GPU 上执行。
  • Concatenate the results (on CPU) into one big batch.将结果(在 CPU 上)连接成一个大批次。

You are triggering an error because the input of the CuDNNLSTM layer is empty for at least one of the model copy.您正在触发错误,因为对于 model 副本中的至少一个, CuDNNLSTM层的输入为空。 This is because the divide operations requires that: input // n_gpus > 0这是因为除法运算要求: input // n_gpus > 0

Try this code out:试试这个代码:

input_x = np.random.randn(8, n_timesteps, n_features)
model.predict(input_x)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 在Keras中使用multi_gpu_model恢复培训 - Resume training with multi_gpu_model in Keras 在 keras 中使用 multi_gpu_model 时出现值错误 - valueError when using multi_gpu_model in keras Keras Multi_GPU_Model:像糖蜜一样缓慢 - Keras Multi_GPU_Model: Slow like Molasses 在 multi_gpu_model 之后访问 Keras 中的中间层 - Access an intermediate layer in Keras after multi_gpu_model 无法从 keras.utils 导入 multi_gpu_model - Cannot import multi_gpu_model from keras.utils Keras的“ multi_gpu_model”用法导致错误“ yolo_head”未定义 - Keras `multi_gpu_model` usage causes error `yolo_head` is not defined Keras multi_gpu_model 返回错误“tensorflow_core._api.v2.config”没有属性“experimental_list_devices” - Keras multi_gpu_model returns error 'tensorflow_core._api.v2.config' has no attribute 'experimental_list_devices' Keras multi_gpu_model 错误:“swig/python 检测到‘int64_t *’类型的内存泄漏,未找到析构函数” - Keras multi_gpu_model error: "swig/python detected a memory leak of type 'int64_t *', no destructor found" Tensorflow / keras multi_gpu_model未拆分为多个GPU - Tensorflow / keras multi_gpu_model is not splitted to more than one gpu 为什么我在multi_gpu_model的Keras中的训练速度比单gpu差? - Why my training speed in Keras with multi_gpu_model is worse than single gpu?
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM