为什么Keras LSTM在CPU上比GPU快三倍？

Question

I use this notebook from Kaggle to run LSTM neural network. 我使用Kaggle的这款笔记本来运行LSTM神经网络。

I had started training of neural network and I saw that it is too slow. 我开始训练神经网络，我发现它太慢了。 It is almost three times slower than CPU training. 它几乎比CPU训练慢三倍。

CPU perfomance: 8 min per epoch; CPU perfomance:每个时期8分钟;
GPU perfomance: 26 min per epoch. GPU perfomance:每个时期26分钟。

After this I decided to find answer in this question on Stackoverflow and I applied a CuDNNLSTM (which runs only on GPU) instead of LSTM . 在此之后，我决定在Stackoverflow上找到这个问题的答案，并且我应用了CuDNNLSTM （仅在GPU上运行）而不是LSTM 。

Hence, GPU perfomance became only 1 min per epoch and accuracy of model decreased on 3%. 因此，GPU性能每个时期仅变为1分钟 ，模型的准确度降低3％。

Questions: 问题：

1) Does somebody know why GPU works slower than CPU in the classic LSTM layer? 1）有人知道为什么GPU在经典LSTM层中的工作速度比CPU慢吗？ I do not understand why this happens. 我不明白为什么会这样。

2) Why when I use CuDNNLSTM instead of LSTM , training become much more faster and the accuracy of the model decrease? 2）为什么当我使用CuDNNLSTM代替LSTM ，训练变得更快，模型的准确性会降低？

PS: PS：

My CPU: Intel Core i7-7700 Processor (8M Cache, up to 4.20 GHz) My CPU: Intel Core i7-7700处理器（8M高速缓存，高达4.20 GHz）

My GPU: nVidia GeForce GTX 1050 Ti (4 GB) My GPU: nVidia GeForce GTX 1050 Ti（4 GB）

Answer 1

Guessing it's just a different, better implementation and, if the implementation is different, you shouldn't expect identical results. 猜测它只是一个不同的，更好的实现，如果实现不同，你不应该期望相同的结果。

In general, efficiently implementing an algorithm on a GPU is hard and getting maximum performance requires architecture-specific implementations. 通常，在GPU上有效地实现算法很困难并且获得最大性能需要特定于体系结构的实现。 Therefore, it wouldn't be surprising if an implementation specific to Nvidia's GPUs had enhanced performance versus a general implementation for GPUs. 因此，如果针对Nvidia GPU的特定实现与GPU的一般实现相比具有增强的性能，那就不足为奇了。 It also wouldn't be surprising that Nvidia would sink significantly more resources into accelerating their code for their GPUs versus than would a team working on a general CNN implementation. 同样也不足为奇的是，Nvidia会为加速他们的GPU代码而不是一个致力于一般CNN实施的团队吸收更多资源。

The other possibility is that the data type used on the backend has changed from double- to single- or even half-precision float. 另一种可能性是后端使用的数据类型已从双精度浮点数变为单半精度或甚至半精度浮点数。 The smaller data types mean you can crunch more numbers faster at the cost of accuracy. 较小的数据类型意味着您可以以准确性为代价更快地处理更多数字。 For NN applications this is often acceptable because no individual number needs to be especially accurate for the net to produce acceptable results. 对于NN应用，这通常是可接受的，因为没有单个数量需要对网络特别准确以产生可接受的结果。

Answer 2

I had a similar problem today and found two things that may be helpful to others (this is a regression problem on a data set with ~2.1MM rows, running on a machine with 4 P100 GPUs): 我今天遇到了类似的问题，发现了两件可能对其他人有帮助的事情（这是一个回归问题，数据集~2.1MM行，在4台P100 GPU的机器上运行）：

Using the CuDNNLSTM layer instead of the LSTM layer on a GPU machine reduced the fit time from ~13500 seconds to ~400 seconds per epoch. 在GPU机器上使用CuDNNLSTM层而不是LSTM层将适合时间从约13500秒减少到每个时期约400秒。
Increasing the batch size (~500 to ~4700) reduced it to ~130 seconds per epoch. 增加批量大小（~500到~4700）将每个时期减少到~130秒。

Reducing the batch size has increase loss and val loss, so you'll need to make a decision about the trade offs you want to make. 减少批量大小会增加损失和val损失，因此您需要做出关于您想要做出的权衡的决定。

Answer 3

In Keras, the fast LSTM implementation with CuDNN. 在Keras中，使用CuDNN实现快速LSTM。

model.add(CuDNNLSTM(units, input_shape=(len(X_train), len(X_train[0])), return_sequences=True))

It can only be run on the GPU with the TensorFlow backend. 它只能在带有TensorFlow后端的GPU上运行。

为什么Keras LSTM在CPU上比GPU快三倍？

问题描述

3 个解决方案

解决方案1
6 已采纳 2018-09-24 14:00:47

解决方案2
2 2019-01-22 23:08:21

解决方案3
0 2018-10-30 12:10:35

为什么Keras LSTM在CPU上比GPU快三倍？

问题描述

3 个解决方案

解决方案1 6 已采纳 2018-09-24 14:00:47

解决方案2 2 2019-01-22 23:08:21

解决方案3 0 2018-10-30 12:10:35

解决方案1
6 已采纳 2018-09-24 14:00:47

解决方案2
2 2019-01-22 23:08:21

解决方案3
0 2018-10-30 12:10:35