TF.Keras model.predict 比直线 Numpy 慢？

Question

Thanks, everyone for trying to help me understand the issue below.谢谢大家帮助我理解下面的问题。 I have updated the question and produced a CPU-only run and GPU-only of the run.我已经更新了问题并生成了仅 CPU运行和仅 GPU运行。 In general, it also appears that in either case a direct numpy calculation hundreds of times faster than the model. predict() .一般来说，在任何一种情况下，直接numpy计算都比model. predict()快数百倍。 Hopefully, this clarifies that this does not appear to be a CPU vs GPU issue (if it is, I would love an explanation).希望这能澄清这似乎不是CPU vs GPU问题（如果是，我希望得到解释）。

Let's create a trained model with keras.让我们用 keras 创建一个经过训练的 model。

import tensorflow as tf

(X,Y),(Xt,Yt) = tf.keras.datasets.mnist.load_data()

model = tf.keras.models.Sequential([
    tf.keras.layers.Flatten(),
    tf.keras.layers.Dense(1000,'relu'),
    tf.keras.layers.Dense(100,'relu'),
    tf.keras.layers.Dense(10,'softmax'),
])
model.compile('adam','sparse_categorical_crossentropy')
model.fit(X,Y,epochs=20,batch_size=1024)

Now let's re-create the model.predict function using numpy .现在让我们使用 numpy 重新创建model.predict numpy 。

import numpy as np

W = model.get_weights()

def predict(X):
    X      = X.reshape((X.shape[0],-1))           #Flatten
    X      = X @ W[0] + W[1]                      #Dense
    X[X<0] = 0                                    #Relu
    X      = X @ W[2] + W[3]                      #Dense
    X[X<0] = 0                                    #Relu
    X      = X @ W[4] + W[5]                      #Dense
    X      = np.exp(X)/np.exp(X).sum(1)[...,None] #Softmax
    return X

We can easily verify these are the same function (module machine errors in implementation).我们可以很容易地验证这些是相同的 function（实现中的模块机器错误）。

print(model.predict(X[:100]).argmax(1))
print(predict(X[:100]).argmax(1))

We can also test out how fast these functions run.我们还可以测试这些函数的运行速度。 Using ipython :使用ipython ：

%timeit model.predict(X[:10]).argmax(1) # 10   loops takes 37.7 ms
%timeit predict(X[:10]).argmax(1)       # 1000 loops takes 356 µs

I get that predict runs about 10,000 times faster than model. predict at low batches and reduces to around 100 times faster at larger batches.我得到的predict运行速度比model. predict快10,000倍。预测在低批次时减少到大约100倍的速度在较大的批次。 Regardless, why is predict so much faster?无论如何，为什么predict要快得多？ In fact, predict isn't even optimized, we could use numba , or even straight re-write predict in C code and compile it.事实上， predict甚至没有优化，我们可以使用numba ，甚至直接在C代码中重写predict并编译它。

Thinking in terms of deployment purposes, why would manually extracting the weights from the model and re-writing the function be thousands of times faster than what keras does internally?考虑到部署目的，为什么手动从 model 中提取权重并重写 function 比keras内部执行的操作快数千倍？ This also means that writing a script to utilize a .h5 file or similar, maybe much slower than manually re-writing the prediction function. In general, is this true?这也意味着编写脚本以利用.h5文件或类似文件，可能比手动重写预测 function 慢得多。一般来说，这是真的吗？

Ipython Output (CPU): Ipython Output（中央处理器）：

Python 3.8.5 (default, Sep  3 2020, 21:29:08) [MSC v.1916 64 bit (AMD64)]
Type 'copyright', 'credits' or 'license' for more information
IPython 7.19.0 -- An enhanced Interactive Python. Type '?' for help.
PyDev console: using IPython 7.19.0
Python 3.8.5 (default, Sep  3 2020, 21:29:08) [MSC v.1916 64 bit (AMD64)] on win32
import os
os.environ["CUDA_VISIBLE_DEVICES"]="-1"    
import tensorflow as tf
(X,Y),(Xt,Yt) = tf.keras.datasets.mnist.load_data()
model = tf.keras.models.Sequential([
    tf.keras.layers.Flatten(),
    tf.keras.layers.Dense(1000,'relu'),
    tf.keras.layers.Dense(100,'relu'),
    tf.keras.layers.Dense(10,'softmax'),
])
model.compile('adam','sparse_categorical_crossentropy')
model.fit(X,Y,epochs=20,batch_size=1024)
2021-04-19 15:10:58.323137: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library cudart64_110.dll
2021-04-19 15:11:01.990590: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library nvcuda.dll
2021-04-19 15:11:02.039285: E tensorflow/stream_executor/cuda/cuda_driver.cc:328] failed call to cuInit: CUDA_ERROR_NO_DEVICE: no CUDA-capable device is detected
2021-04-19 15:11:02.042553: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:169] retrieving CUDA diagnostic information for host: DESKTOP-G0U8S3P
2021-04-19 15:11:02.043134: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:176] hostname: DESKTOP-G0U8S3P
2021-04-19 15:11:02.128834: I tensorflow/compiler/mlir/mlir_graph_optimization_pass.cc:127] None of the MLIR optimization passes are enabled (registered 2)
Epoch 1/20
59/59 [==============================] - 4s 60ms/step - loss: 35.3708
Epoch 2/20
59/59 [==============================] - 3s 58ms/step - loss: 0.8671
Epoch 3/20
59/59 [==============================] - 3s 56ms/step - loss: 0.5641
Epoch 4/20
59/59 [==============================] - 3s 56ms/step - loss: 0.4359
Epoch 5/20
59/59 [==============================] - 3s 56ms/step - loss: 0.3447
Epoch 6/20
59/59 [==============================] - 3s 56ms/step - loss: 0.2891
Epoch 7/20
59/59 [==============================] - 3s 56ms/step - loss: 0.2371
Epoch 8/20
59/59 [==============================] - 3s 57ms/step - loss: 0.1977
Epoch 9/20
59/59 [==============================] - 3s 57ms/step - loss: 0.1713
Epoch 10/20
59/59 [==============================] - 3s 57ms/step - loss: 0.1381
Epoch 11/20
59/59 [==============================] - 4s 61ms/step - loss: 0.1203
Epoch 12/20
59/59 [==============================] - 3s 57ms/step - loss: 0.1095
Epoch 13/20
59/59 [==============================] - 3s 56ms/step - loss: 0.0877
Epoch 14/20
59/59 [==============================] - 3s 57ms/step - loss: 0.0793
Epoch 15/20
59/59 [==============================] - 3s 56ms/step - loss: 0.0727
Epoch 16/20
59/59 [==============================] - 3s 56ms/step - loss: 0.0702
Epoch 17/20
59/59 [==============================] - 3s 56ms/step - loss: 0.0701
Epoch 18/20
59/59 [==============================] - 3s 57ms/step - loss: 0.0631
Epoch 19/20
59/59 [==============================] - 3s 56ms/step - loss: 0.0539
Epoch 20/20
59/59 [==============================] - 3s 58ms/step - loss: 0.0493
Out[3]: <tensorflow.python.keras.callbacks.History at 0x143069fdf40>

import numpy as np
W = model.get_weights()
def predict(X):
    X      = X.reshape((X.shape[0],-1))           #Flatten
    X      = X @ W[0] + W[1]                      #Dense
    X[X<0] = 0                                    #Relu
    X      = X @ W[2] + W[3]                      #Dense
    X[X<0] = 0                                    #Relu
    X      = X @ W[4] + W[5]                      #Dense
    X      = np.exp(X)/np.exp(X).sum(1)[...,None] #Softmax
    return X

%timeit model.predict(X[:10]).argmax(1) # 10   loops takes 37.7 ms
%timeit predict(X[:10]).argmax(1)       # 1000 loops takes 356 µs

52.8 ms ± 2.13 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
640 µs ± 10.9 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

Ipython Output (GPU): Ipython Output（GPU）：

Python 3.7.7 (default, Mar 26 2020, 15:48:22) 
Type 'copyright', 'credits' or 'license' for more information
IPython 7.4.0 -- An enhanced Interactive Python. Type '?' for help.

In [1]: import tensorflow as tf 
   ...:  
   ...: (X,Y),(Xt,Yt) = tf.keras.datasets.mnist.load_data() 
   ...:  
   ...: model = tf.keras.models.Sequential([ 
   ...:     tf.keras.layers.Flatten(), 
   ...:     tf.keras.layers.Dense(1000,'relu'), 
   ...:     tf.keras.layers.Dense(100,'relu'), 
   ...:     tf.keras.layers.Dense(10,'softmax'), 
   ...: ]) 
   ...: model.compile('adam','sparse_categorical_crossentropy') 
   ...: model.fit(X,Y,epochs=20,batch_size=1024)                                                                                                                                                                   
2020-07-01 15:50:46.008518: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcuda.so.1
2020-07-01 15:50:46.054495: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1618] Found device 0 with properties: 
name: GeForce RTX 2080 Ti major: 7 minor: 5 memoryClockRate(GHz): 1.545
pciBusID: 0000:05:00.0
2020-07-01 15:50:46.059582: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.0
2020-07-01 15:50:46.114562: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10.0
2020-07-01 15:50:46.142058: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcufft.so.10.0
2020-07-01 15:50:46.152899: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcurand.so.10.0
2020-07-01 15:50:46.217725: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusolver.so.10.0
2020-07-01 15:50:46.260758: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusparse.so.10.0
2020-07-01 15:50:46.374328: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7
2020-07-01 15:50:46.376747: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1746] Adding visible gpu devices: 0
2020-07-01 15:50:46.377688: I tensorflow/core/platform/cpu_feature_guard.cc:142] Your CPU supports instructions that this TensorFlow binary was not compiled to use: SSE4.1 SSE4.2 AVX FMA
2020-07-01 15:50:46.433422: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 4018875000 Hz
2020-07-01 15:50:46.434383: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x563e4d0d71c0 executing computations on platform Host. Devices:
2020-07-01 15:50:46.435119: I tensorflow/compiler/xla/service/service.cc:175]   StreamExecutor device (0): Host, Default Version
2020-07-01 15:50:46.596077: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x563e4a9379f0 executing computations on platform CUDA. Devices:
2020-07-01 15:50:46.596119: I tensorflow/compiler/xla/service/service.cc:175]   StreamExecutor device (0): GeForce RTX 2080 Ti, Compute Capability 7.5
2020-07-01 15:50:46.597894: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1618] Found device 0 with properties: 
name: GeForce RTX 2080 Ti major: 7 minor: 5 memoryClockRate(GHz): 1.545
pciBusID: 0000:05:00.0
2020-07-01 15:50:46.597961: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.0
2020-07-01 15:50:46.597988: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10.0
2020-07-01 15:50:46.598014: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcufft.so.10.0
2020-07-01 15:50:46.598040: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcurand.so.10.0
2020-07-01 15:50:46.598065: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusolver.so.10.0
2020-07-01 15:50:46.598090: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusparse.so.10.0
2020-07-01 15:50:46.598115: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7
2020-07-01 15:50:46.599766: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1746] Adding visible gpu devices: 0
2020-07-01 15:50:46.600611: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.0
2020-07-01 15:50:46.603713: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1159] Device interconnect StreamExecutor with strength 1 edge matrix:
2020-07-01 15:50:46.603751: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1165]      0 
2020-07-01 15:50:46.603763: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1178] 0:   N 
2020-07-01 15:50:46.605917: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1304] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 10311 MB memory) -> physical GPU (device: 0, name: GeForce RTX 2080 Ti, pci bus id: 0000:05:00.0, compute capability: 7.5)
Train on 60000 samples
Epoch 1/20
2020-07-01 15:50:49.995091: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10.0
60000/60000 [==============================] - 2s 26us/sample - loss: 9.9370
Epoch 2/20
60000/60000 [==============================] - 0s 4us/sample - loss: 0.6094
Epoch 3/20
60000/60000 [==============================] - 0s 4us/sample - loss: 0.3672
Epoch 4/20
60000/60000 [==============================] - 0s 4us/sample - loss: 0.2720
Epoch 5/20
60000/60000 [==============================] - 0s 4us/sample - loss: 0.2196
Epoch 6/20
60000/60000 [==============================] - 0s 4us/sample - loss: 0.1673
Epoch 7/20
60000/60000 [==============================] - 0s 4us/sample - loss: 0.1367
Epoch 8/20
60000/60000 [==============================] - 0s 4us/sample - loss: 0.1082
Epoch 9/20
60000/60000 [==============================] - 0s 4us/sample - loss: 0.0895
Epoch 10/20
60000/60000 [==============================] - 0s 4us/sample - loss: 0.0781
Epoch 11/20
60000/60000 [==============================] - 0s 4us/sample - loss: 0.0666
Epoch 12/20
60000/60000 [==============================] - 0s 4us/sample - loss: 0.0537
Epoch 13/20
60000/60000 [==============================] - 0s 4us/sample - loss: 0.0459
Epoch 14/20
60000/60000 [==============================] - 0s 4us/sample - loss: 0.0412
Epoch 15/20
60000/60000 [==============================] - 0s 4us/sample - loss: 0.0401
Epoch 16/20
60000/60000 [==============================] - 0s 4us/sample - loss: 0.0318
Epoch 17/20
60000/60000 [==============================] - 0s 4us/sample - loss: 0.0275
Epoch 18/20
60000/60000 [==============================] - 0s 4us/sample - loss: 0.0237
Epoch 19/20
60000/60000 [==============================] - 0s 4us/sample - loss: 0.0212
Epoch 20/20
60000/60000 [==============================] - 0s 4us/sample - loss: 0.0199
Out[1]: <tensorflow.python.keras.callbacks.History at 0x7f7c9000b550>

In [2]: import numpy as np 
   ...:  
   ...: W = model.get_weights() 
   ...:  
   ...: def predict(X): 
   ...:     X      = X.reshape((X.shape[0],-1))           #Flatten 
   ...:     X      = X @ W[0] + W[1]                      #Dense 
   ...:     X[X<0] = 0                                    #Relu 
   ...:     X      = X @ W[2] + W[3]                      #Dense 
   ...:     X[X<0] = 0                                    #Relu 
   ...:     X      = X @ W[4] + W[5]                      #Dense 
   ...:     X      = np.exp(X)/np.exp(X).sum(1)[...,None] #Softmax 
   ...:     return X 
   ...:                                                                                                                                                                                                            

In [3]: print(model.predict(X[:100]).argmax(1)) 
   ...: print(predict(X[:100]).argmax(1))                                                                                                                                                                          
[5 0 4 1 9 2 1 3 1 4 3 5 3 6 1 7 2 8 6 9 4 0 9 1 1 2 4 3 2 7 3 8 6 9 0 5 6
 0 7 6 1 8 7 9 3 9 8 5 9 3 3 0 7 4 9 8 0 9 4 1 4 4 6 0 4 5 6 1 0 0 1 7 1 6
 3 0 2 1 1 7 5 0 2 6 7 8 3 9 0 4 6 7 4 6 8 0 7 8 3 1]
/home/bobbyocean/anaconda3/bin/ipython3:12: RuntimeWarning: overflow encountered in exp
/home/bobbyocean/anaconda3/bin/ipython3:12: RuntimeWarning: invalid value encountered in true_divide
[5 0 4 1 9 2 1 3 1 4 3 5 3 6 1 7 2 8 6 9 4 0 9 1 1 2 4 3 2 7 3 8 6 9 0 5 6
 0 7 6 1 8 7 9 3 9 8 5 9 3 3 0 7 4 9 8 0 9 4 1 4 4 6 0 4 5 6 1 0 0 1 7 1 6
 3 0 2 1 1 7 5 0 2 6 7 8 3 9 0 4 6 7 4 6 8 0 7 8 3 1]

In [4]: %timeit model.predict(X[:10]).argmax(1)                                                                                                                                                                    
37.7 ms ± 806 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [5]: %timeit predict(X[:10]).argmax(1)                                                                                                                                                                          
361 µs ± 13.8 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

Answer 1

We observe that the main issue is the cause of the Eager Execution mode .我们观察到主要问题是Eager Execution模式的原因。 We give shallow look at your code and corresponding results as per CPU and GPU bases.我们根据CPU和GPU基础对您的代码和相应的结果进行浅显的了解。 It is true that numpy doesn't operate on GPU , so unlike tf-gpu , it doesn't encounter any data shifting overhead.确实numpy不在GPU上运行，因此与tf-gpu不同，它不会遇到任何数据移位开销。

But also it's quite noticeable how much fast computation is done by your defined predict method with np compare to model. predict但是，与model. predict相比，使用np定义的predict方法完成的计算速度也很明显。 model. predict with tf. keras用model. predict tf. keras tf. keras , whereas the input test set is 10 samples only . tf. keras ，而输入测试集只有10 个样本。 However, We're not giving any deep analysis, like one piece of art here you may love to read.但是，我们不会进行任何深入的分析，就像您可能喜欢阅读的一件艺术品一样。

My Setup is as follows.我的设置如下。 I'm using the Colab environment and checking with both CPU and GPU mode.我正在使用Colab环境并检查CPU和GPU模式。

TensorFlow 1.15.2
Keras 2.3.1
Numpy 1.19.5

TensorFlow 2.4.1
Keras 2.4.0
Numpy 1.19.5

TF 1.15.2 - CPU TF 1.15.2 - 中央处理器

%tensorflow_version 1.x

import os
os.environ["CUDA_VISIBLE_DEVICES"]="-1"   

import tensorflow as tf
from tensorflow.python.client import device_lib

print(tf.__version__)
print('A: ', tf.test.is_built_with_cuda)
print('B: ', tf.test.gpu_device_name())
local_device_protos = device_lib.list_local_devices()
([x.name for x in local_device_protos if x.device_type == 'GPU'], 
 [x.name for x in local_device_protos if x.device_type == 'CPU'])

TensorFlow 1.x selected.
1.15.2
A:  <function is_built_with_cuda at 0x7f122d58dcb0>
B:  
([], ['/device:CPU:0'])

Now, running your code.现在，运行您的代码。

import tensorflow as tf
import keras
print(tf.executing_eagerly()) # False

(X,Y),(Xt,Yt) = keras.datasets.mnist.load_data()
model = keras.models.Sequential([])
model.compile
model.fit

%timeit model.predict(X[:10]).argmax(1) # yours: 10   loops takes 37.7 ms
%timeit predict(X[:10]).argmax(1)       # yours: 1000 loops takes 356 µs

1000 loops, best of 5: 1.07 ms per loop
1000 loops, best of 5: 1.48 ms per loop

We can see that the execution times are comparable with old keras .我们可以看到执行时间与旧的keras相当。 Now, let's test with GPU also.现在，让我们也用GPU进行测试。

TF 1.15.2 - GPU TF 1.15.2 - GPU

%tensorflow_version 1.x

import os
os.environ["CUDA_VISIBLE_DEVICES"]="0"   

import tensorflow as tf
from tensorflow.python.client import device_lib

print(tf.__version__)
print('A: ', tf.test.is_built_with_cuda)
print('B: ', tf.test.gpu_device_name())
local_device_protos = device_lib.list_local_devices()
([x.name for x in local_device_protos if x.device_type == 'GPU'], 
 [x.name for x in local_device_protos if x.device_type == 'CPU'])

1.15.2
A:  <function is_built_with_cuda at 0x7f0b5ad46830>
B:  /device:GPU:0
(['/device:GPU:0'], ['/device:CPU:0'])

...
...
%timeit model.predict(X[:10]).argmax(1) # yours: 10   loops takes 37.7 ms
%timeit predict(X[:10]).argmax(1)       # yours: 1000 loops takes 356 µs

1000 loops, best of 5: 1.02 ms per loop
1000 loops, best of 5: 1.44 ms per loop

Now, the execution time is also comparable here with old keras and no eager mode.现在，这里的执行时间也与旧的keras和无 Eager 模式相当。 Let's now see the new tf. keras现在让我们看看新的tf. keras tf. keras with eager mode first and then we observe without eager mode. tf. keras首先使用 Eager 模式，然后我们观察没有 Eager 模式。

TF 2.4.1 - CPU TF 2.4.1 - 中央处理器

Eagerly热切地

import os
os.environ["CUDA_VISIBLE_DEVICES"]="-1"   

import tensorflow as tf
from tensorflow.python.client import device_lib

print(tf.__version__)
print('A: ', tf.test.is_built_with_cuda)
print('B: ', tf.test.gpu_device_name())
local_device_protos = device_lib.list_local_devices()
([x.name for x in local_device_protos if x.device_type == 'GPU'], 
 [x.name for x in local_device_protos if x.device_type == 'CPU'])

2.4.1
A:  <function is_built_with_cuda at 0x7fed85de3560>
B:  
([], ['/device:CPU:0'])

Now, running the code with eager mode.现在，以 Eager 模式运行代码。

import tensorflow as tf
import keras

print(tf.executing_eagerly())  # True
(X,Y),(Xt,Yt) = keras.datasets.mnist.load_data()

model = keras.models.Sequential([ ])
model.compile
model.fit

%timeit model.predict(X[:10]).argmax(1) # yours: 10   loops takes 37.7 ms
%timeit predict(X[:10]).argmax(1)       # yours: 1000 loops takes 356 µs

10 loops, best of 5: 28 ms per loop
1000 loops, best of 5: 1.73 ms per loop

Disable Eagerly急切地禁用

Now, if we disable the eager mode and run the same code as follows then we will get:现在，如果我们禁用 Eager 模式并运行以下相同的代码，那么我们将得到：

import tensorflow as tf
import keras

# # Disables eager execution
tf.compat.v1.disable_eager_execution()
# or, 
# Disables eager execution of tf.functions.
# tf.config.run_functions_eagerly(False)
print(tf.executing_eagerly())
False

(X,Y),(Xt,Yt) = keras.datasets.mnist.load_data()
model = keras.models.Sequential([])
model.compile
model.fit

%timeit model.predict(X[:10]).argmax(1) # yours: 10   loops takes 37.7 ms
%timeit predict(X[:10]).argmax(1)       # yours: 1000 loops takes 356 µs

1000 loops, best of 5: 1.37 ms per loop
1000 loops, best of 5: 1.57 ms per loop

Now, we can see the execution times are comparable for disabling the eager mode in new tf. keras现在，我们可以看到在 new tf. keras中禁用急切模式的执行时间相当。 tf. keras . tf. keras 。 Now, let's test with GPU mode also.现在，让我们也使用GPU模式进行测试。

TF 2.4.1 - GPU TF 2.4.1 - GPU

Eagerly热切地

import os
os.environ["CUDA_VISIBLE_DEVICES"]="0"   

import tensorflow as tf
from tensorflow.python.client import device_lib

print(tf.__version__)
print('A: ', tf.test.is_built_with_cuda)
print('B: ', tf.test.gpu_device_name())
local_device_protos = device_lib.list_local_devices()
([x.name for x in local_device_protos if x.device_type == 'GPU'], 
 [x.name for x in local_device_protos if x.device_type == 'CPU'])

2.4.1
A:  <function is_built_with_cuda at 0x7f16ad88f680>
B:  /device:GPU:0
(['/device:GPU:0'], ['/device:CPU:0'])

import tensorflow as tf
import keras

print(tf.executing_eagerly()) # True
(X,Y),(Xt,Yt) = keras.datasets.mnist.load_data()
model = keras.models.Sequential([ ])
model.compile
model.fit

%timeit model.predict(X[:10]).argmax(1) # yours: 10   loops takes 37.7 ms
%timeit predict(X[:10]).argmax(1)       # yours: 1000 loops takes 356 µs

10 loops, best of 5: 26.3 ms per loop
1000 loops, best of 5: 1.48 ms per loop

Disable Eagerly急切地禁用

And lastly again, if we disable the eager mode and run the same code as follows, we will get:最后，如果我们禁用 Eager 模式并运行以下相同的代码，我们将得到：

# Disables eager execution
tf.compat.v1.disable_eager_execution()
# or, 
# Disables eager execution of tf.functions.
# tf.config.run_functions_eagerly(False)
print(tf.executing_eagerly()) # False 

(X,Y),(Xt,Yt) = keras.datasets.mnist.load_data()
model = keras.models.Sequential([ ])
model.compile
model.fit

%timeit model.predict(X[:10]).argmax(1) # yours: 10   loops takes 37.7 ms
%timeit predict(X[:10]).argmax(1)       # yours: 1000 loops takes 356 µs

1000 loops, best of 5: 1.12 ms per loop
1000 loops, best of 5: 1.45 ms per loop

And like before, the execution times are comparable with the non-eager mode in new tf. keras和以前一样，执行时间与new tf. keras tf. keras . tf. keras 。 That's why, the Eager mode is the root cause of the slower performance of tf. keras这就是为什么， Eager 模式是导致tf. keras tf. keras than straight numpy . tf. keras比直numpy 。

Answer 2

Another answer is more useful in terms of "how to make tf keras predict faster", but I think the following can help more in terms of "what is it doing that takes so much time"?另一个答案在“如何使tf keras 预测更快”方面更有用，但我认为以下内容可以帮助更多“它在做什么需要这么多时间”？ Even with eager mode disabled, you might be curious to see how execution looks like (eg with or without providing batch_size, etc.).即使禁用了急切模式，您也可能想知道执行的样子（例如，提供或不提供batch_size 等）。

To answer this question you may find tracing profiler to be useful.要回答这个问题，您可能会发现跟踪分析器很有用。 Tracing execution is adding a lot of overhead (especially for the places that have a bunch of very lightweight python calls), but overall should give you quite a bit of insight into what part of python code is being executed, because, well, it just logs exactly what is happening.跟踪执行会增加很多开销（特别是对于有一堆非常轻量级的 python 调用的地方），但总的来说应该让您对正在执行 python 代码的哪一部分有相当多的了解，因为，好吧，它只是准确记录正在发生的事情。 You can try pytracing since it produces files that Chrome browser visualizes nicely on it's built-in chrome://tracing page.您可以尝试pytracing ，因为它会生成 Chrome 浏览器在其内置chrome://tracing页面上很好地可视化的文件。 To use it eg in google colab you can do the following:要使用它，例如在 google colab 中，您可以执行以下操作：

First, intstall pytracing:首先，安装pytracing：

!pip install pytracing

Then generate trace:然后生成跟踪：

from pytracing import TraceProfiler
tp = TraceProfiler(output=open('/root/trace.out', 'wt'))
with tp.traced():
  for i in range(2): 
    model.predict(X[:1000], batch_size=1000)

Then download trace:然后下载跟踪：

from google.colab import files
files.download('/root/trace.out')

After this in Chrome browser open chrome://tracing page, click "Load" button, and select trace.out file, you've downloaded.之后在 Chrome 浏览器中打开chrome://tracing页面，点击“Load”按钮，然后 select trace.out 文件就下载好了。

You'll see something like the following - you can click on any element, see the full name of the python function and file it is from + wall time it took (again, all of this is higher than in normal runs due to tracing overhead):您将看到类似以下内容 - 您可以单击任何元素，查看 python 的全名 function 和文件它来自 + 所花费的时间（再次，由于跟踪开销，所有这些都高于正常运行):

You can see how disabling/enabling eager execution or changing the batch size will change the output and can see for yourself what is taking the most time.您可以看到禁用/启用急切执行或更改批处理大小将如何更改 output 并且可以亲自查看花费最多的时间。 From what I currently see (in non-eagerly mode + call like model.predict(X[:1000], batch_size=1000) ) quite a bit of time is spent on:从我目前看到的情况来看（在非急切模式下+调用model.predict(X[:1000], batch_size=1000) ）相当多的时间花在：

Standardizing your data (whatever it means): ~2.5ms (including tracing overhead:):标准化您的数据（无论是什么意思）：~2.5ms（包括跟踪开销：）：

/usr/local/lib/python3.7/dist-packages/tensorflow/python/keras/engine/training_v1.py:2336:_standardize_user_data

Preparing callbacks (even though we didn't set any): ~2ms (including tracing overhead)准备回调（即使我们没有设置任何回调）：~2ms（包括跟踪开销）

/usr/local/lib/python3.7/dist-packages/tensorflow/python/keras/callbacks.py:133:configure_callbacks

As to the statement that numpy version is not optimized - I wouldn't agree.至于numpy版本没有优化的说法——我不同意。 The numpy implementation here is quite optimized - python is not making any pure python calls in it (execution of predict only results call to functions in C - I couldn't believe it at first but it seems to be the case), so overhead from Python is really minimal. The numpy implementation here is quite optimized - python is not making any pure python calls in it (execution of predict only results call to functions in C - I couldn't believe it at first but it seems to be the case), so overhead from Python 真的很小。 You might gain a little bit by optimizing the way you do ReLU and eliminating extra allocation/deallocation, but that could only result in a very minor perf improvement.通过优化 ReLU 的方式并消除额外的分配/释放，您可能会有所收获，但这只会导致非常小的性能改进。

Answer 3

As others have pointed out, the Tensorflow binary in question was compiled for GPU optimization: while GPUs are excellent at intensive number-crunching because of the extremely high number of computing cores they have, they're painfully slow when it comes to moving data back and forth.正如其他人所指出的那样，有问题的 Tensorflow 二进制文件是为 GPU 优化而编译的：虽然 GPU 由于具有极高数量的计算核心而在密集的数字处理方面表现出色，但在将数据移回时它们却非常缓慢来回。

When a model is executing on a graphics card, all the necessary data has to be burst over to the GPU -- it has no access to the host system's RAM (nor does the host system have access to the video memory).当 model 在显卡上执行时，所有必要的数据都必须突发传输到 GPU——它无法访问主机系统的 RAM（主机系统也无法访问视频内存）。 Once the GPU is finished processing, any results have to be shipped back over to the host system.一旦 GPU 完成处理，所有结果都必须运回主机系统。

All of this moving around of data takes a great deal of time;所有这些数据的移动都需要大量时间。 moreover, a Tensorflow binary compiled to execute with a GPU/CUDA does not, to my knowledge, include any of the standard optimizations for executing on a CPU (like using faster extended instruction sets, eg AVX, AVX2, etc).此外，据我所知，编译为使用 GPU/CUDA 执行的 Tensorflow 二进制文件不包括任何用于在 CPU 上执行的标准优化（例如使用更快的扩展指令集，例如 AVX、AVX2 等）。

So you're comparing a highly CPU-optimized scientific library which can process data without even having to go back to RAM half the time (CPU registers and cache storage on the chip);因此，您正在比较一个高度 CPU 优化的科学库，它可以处理数据，甚至无需将 go 一半时间返回 RAM（CPU 寄存器和芯片上的缓存存储）； to code that has to collect every last bit it's going to need before having to ship all of that data to the graphics card, and back.代码必须在将所有数据发送到显卡并返回之前收集它需要的最后一点。 I'm also leaving out all of the data manipulations that goes on under the hood of Tensorflow.我还省略了 Tensorflow 引擎盖下进行的所有数据操作。 It works on its own data structures, after all.毕竟，它适用于自己的数据结构。

Eager execution is also another layer of inefficiency, I suppose.我想，急切的执行也是效率低下的另一层。

As for best practices in deploying Keras models, I'm of the opinion that it's like everything else in software: premature optimization is the root of all evil.至于部署 Keras 模型的最佳实践，我认为它就像软件中的其他一切一样：过早优化是万恶之源。 If you don't need it to be fast and lean, then let it be slow, modular, reusable, and intuitive.如果您不需要它快速和精简，那么让它缓慢、模块化、可重用和直观。 But hey, if you need or want the efficiency, then power to you.但是，嘿，如果你需要或想要效率，那就给你力量。 Keras is designed for rapid development and research, not production code. Keras 设计用于快速开发和研究，而非生产代码。

In short, the answer is that it's for the same reason C++ is faster than Python (because the Python interpreter has so much more overhead, as does Tensorflow).简而言之，答案是出于同样的原因 C++ 比 Python 快（因为 Python 解释器具有更多的开销）

Answer 4

Instead of model.predict(input), try simply model(input)而不是 model.predict(input)，尝试简单的 model(input)

TF.Keras model.predict 比直线 Numpy 慢？

问题描述

4 个解决方案

解决方案1
5 已采纳 2021-04-24 00:05:41

TF 1.15.2 - CPU TF 1.15.2 - 中央处理器

TF 1.15.2 - GPU TF 1.15.2 - GPU

TF 2.4.1 - CPU TF 2.4.1 - 中央处理器

TF 2.4.1 - GPU TF 2.4.1 - GPU

解决方案2
3 2021-04-25 15:17:10

解决方案3
2 2021-04-19 05:33:49

解决方案4
0 2022-12-27 15:45:12

TF.Keras model.predict 比直线 Numpy 慢？

问题描述

4 个解决方案

解决方案1 5 已采纳 2021-04-24 00:05:41

TF 1.15.2 - CPU TF 1.15.2 - 中央处理器

TF 1.15.2 - GPU TF 1.15.2 - GPU

TF 2.4.1 - CPU TF 2.4.1 - 中央处理器

TF 2.4.1 - GPU TF 2.4.1 - GPU

解决方案2 3 2021-04-25 15:17:10

解决方案3 2 2021-04-19 05:33:49

解决方案4 0 2022-12-27 15:45:12

解决方案1
5 已采纳 2021-04-24 00:05:41

解决方案2
3 2021-04-25 15:17:10

解决方案3
2 2021-04-19 05:33:49

解决方案4
0 2022-12-27 15:45:12