TF.Keras model.predict 比直线 Numpy 慢？

Question

谢谢大家帮助我理解下面的问题。 我已经更新了问题并生成了仅 CPU运行和仅 GPU运行。 一般来说，在任何一种情况下，直接numpy计算都比model. predict()快数百倍。 希望这能澄清这似乎不是CPU vs GPU问题（如果是，我希望得到解释）。

让我们用 keras 创建一个经过训练的 model。

import tensorflow as tf

(X,Y),(Xt,Yt) = tf.keras.datasets.mnist.load_data()

model = tf.keras.models.Sequential([
    tf.keras.layers.Flatten(),
    tf.keras.layers.Dense(1000,'relu'),
    tf.keras.layers.Dense(100,'relu'),
    tf.keras.layers.Dense(10,'softmax'),
])
model.compile('adam','sparse_categorical_crossentropy')
model.fit(X,Y,epochs=20,batch_size=1024)

现在让我们使用 numpy 重新创建model.predict numpy 。

import numpy as np

W = model.get_weights()

def predict(X):
    X      = X.reshape((X.shape[0],-1))           #Flatten
    X      = X @ W[0] + W[1]                      #Dense
    X[X<0] = 0                                    #Relu
    X      = X @ W[2] + W[3]                      #Dense
    X[X<0] = 0                                    #Relu
    X      = X @ W[4] + W[5]                      #Dense
    X      = np.exp(X)/np.exp(X).sum(1)[...,None] #Softmax
    return X

我们可以很容易地验证这些是相同的 function（实现中的模块机器错误）。

print(model.predict(X[:100]).argmax(1))
print(predict(X[:100]).argmax(1))

我们还可以测试这些函数的运行速度。 使用ipython ：

%timeit model.predict(X[:10]).argmax(1) # 10   loops takes 37.7 ms
%timeit predict(X[:10]).argmax(1)       # 1000 loops takes 356 µs

我得到的predict运行速度比model. predict快10,000倍。预测在低批次时减少到大约100倍的速度在较大的批次。 无论如何，为什么predict要快得多？ 事实上， predict甚至没有优化，我们可以使用numba ，甚至直接在C代码中重写predict并编译它。

考虑到部署目的，为什么手动从 model 中提取权重并重写 function 比keras内部执行的操作快数千倍？ 这也意味着编写脚本以利用.h5文件或类似文件，可能比手动重写预测 function 慢得多。一般来说，这是真的吗？

Ipython Output（中央处理器）：

Python 3.8.5 (default, Sep  3 2020, 21:29:08) [MSC v.1916 64 bit (AMD64)]
Type 'copyright', 'credits' or 'license' for more information
IPython 7.19.0 -- An enhanced Interactive Python. Type '?' for help.
PyDev console: using IPython 7.19.0
Python 3.8.5 (default, Sep  3 2020, 21:29:08) [MSC v.1916 64 bit (AMD64)] on win32
import os
os.environ["CUDA_VISIBLE_DEVICES"]="-1"    
import tensorflow as tf
(X,Y),(Xt,Yt) = tf.keras.datasets.mnist.load_data()
model = tf.keras.models.Sequential([
    tf.keras.layers.Flatten(),
    tf.keras.layers.Dense(1000,'relu'),
    tf.keras.layers.Dense(100,'relu'),
    tf.keras.layers.Dense(10,'softmax'),
])
model.compile('adam','sparse_categorical_crossentropy')
model.fit(X,Y,epochs=20,batch_size=1024)
2021-04-19 15:10:58.323137: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library cudart64_110.dll
2021-04-19 15:11:01.990590: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library nvcuda.dll
2021-04-19 15:11:02.039285: E tensorflow/stream_executor/cuda/cuda_driver.cc:328] failed call to cuInit: CUDA_ERROR_NO_DEVICE: no CUDA-capable device is detected
2021-04-19 15:11:02.042553: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:169] retrieving CUDA diagnostic information for host: DESKTOP-G0U8S3P
2021-04-19 15:11:02.043134: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:176] hostname: DESKTOP-G0U8S3P
2021-04-19 15:11:02.128834: I tensorflow/compiler/mlir/mlir_graph_optimization_pass.cc:127] None of the MLIR optimization passes are enabled (registered 2)
Epoch 1/20
59/59 [==============================] - 4s 60ms/step - loss: 35.3708
Epoch 2/20
59/59 [==============================] - 3s 58ms/step - loss: 0.8671
Epoch 3/20
59/59 [==============================] - 3s 56ms/step - loss: 0.5641
Epoch 4/20
59/59 [==============================] - 3s 56ms/step - loss: 0.4359
Epoch 5/20
59/59 [==============================] - 3s 56ms/step - loss: 0.3447
Epoch 6/20
59/59 [==============================] - 3s 56ms/step - loss: 0.2891
Epoch 7/20
59/59 [==============================] - 3s 56ms/step - loss: 0.2371
Epoch 8/20
59/59 [==============================] - 3s 57ms/step - loss: 0.1977
Epoch 9/20
59/59 [==============================] - 3s 57ms/step - loss: 0.1713
Epoch 10/20
59/59 [==============================] - 3s 57ms/step - loss: 0.1381
Epoch 11/20
59/59 [==============================] - 4s 61ms/step - loss: 0.1203
Epoch 12/20
59/59 [==============================] - 3s 57ms/step - loss: 0.1095
Epoch 13/20
59/59 [==============================] - 3s 56ms/step - loss: 0.0877
Epoch 14/20
59/59 [==============================] - 3s 57ms/step - loss: 0.0793
Epoch 15/20
59/59 [==============================] - 3s 56ms/step - loss: 0.0727
Epoch 16/20
59/59 [==============================] - 3s 56ms/step - loss: 0.0702
Epoch 17/20
59/59 [==============================] - 3s 56ms/step - loss: 0.0701
Epoch 18/20
59/59 [==============================] - 3s 57ms/step - loss: 0.0631
Epoch 19/20
59/59 [==============================] - 3s 56ms/step - loss: 0.0539
Epoch 20/20
59/59 [==============================] - 3s 58ms/step - loss: 0.0493
Out[3]: <tensorflow.python.keras.callbacks.History at 0x143069fdf40>

import numpy as np
W = model.get_weights()
def predict(X):
    X      = X.reshape((X.shape[0],-1))           #Flatten
    X      = X @ W[0] + W[1]                      #Dense
    X[X<0] = 0                                    #Relu
    X      = X @ W[2] + W[3]                      #Dense
    X[X<0] = 0                                    #Relu
    X      = X @ W[4] + W[5]                      #Dense
    X      = np.exp(X)/np.exp(X).sum(1)[...,None] #Softmax
    return X

%timeit model.predict(X[:10]).argmax(1) # 10   loops takes 37.7 ms
%timeit predict(X[:10]).argmax(1)       # 1000 loops takes 356 µs

52.8 ms ± 2.13 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
640 µs ± 10.9 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

Ipython Output（GPU）：

Python 3.7.7 (default, Mar 26 2020, 15:48:22) 
Type 'copyright', 'credits' or 'license' for more information
IPython 7.4.0 -- An enhanced Interactive Python. Type '?' for help.

In [1]: import tensorflow as tf 
   ...:  
   ...: (X,Y),(Xt,Yt) = tf.keras.datasets.mnist.load_data() 
   ...:  
   ...: model = tf.keras.models.Sequential([ 
   ...:     tf.keras.layers.Flatten(), 
   ...:     tf.keras.layers.Dense(1000,'relu'), 
   ...:     tf.keras.layers.Dense(100,'relu'), 
   ...:     tf.keras.layers.Dense(10,'softmax'), 
   ...: ]) 
   ...: model.compile('adam','sparse_categorical_crossentropy') 
   ...: model.fit(X,Y,epochs=20,batch_size=1024)                                                                                                                                                                   
2020-07-01 15:50:46.008518: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcuda.so.1
2020-07-01 15:50:46.054495: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1618] Found device 0 with properties: 
name: GeForce RTX 2080 Ti major: 7 minor: 5 memoryClockRate(GHz): 1.545
pciBusID: 0000:05:00.0
2020-07-01 15:50:46.059582: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.0
2020-07-01 15:50:46.114562: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10.0
2020-07-01 15:50:46.142058: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcufft.so.10.0
2020-07-01 15:50:46.152899: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcurand.so.10.0
2020-07-01 15:50:46.217725: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusolver.so.10.0
2020-07-01 15:50:46.260758: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusparse.so.10.0
2020-07-01 15:50:46.374328: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7
2020-07-01 15:50:46.376747: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1746] Adding visible gpu devices: 0
2020-07-01 15:50:46.377688: I tensorflow/core/platform/cpu_feature_guard.cc:142] Your CPU supports instructions that this TensorFlow binary was not compiled to use: SSE4.1 SSE4.2 AVX FMA
2020-07-01 15:50:46.433422: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 4018875000 Hz
2020-07-01 15:50:46.434383: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x563e4d0d71c0 executing computations on platform Host. Devices:
2020-07-01 15:50:46.435119: I tensorflow/compiler/xla/service/service.cc:175]   StreamExecutor device (0): Host, Default Version
2020-07-01 15:50:46.596077: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x563e4a9379f0 executing computations on platform CUDA. Devices:
2020-07-01 15:50:46.596119: I tensorflow/compiler/xla/service/service.cc:175]   StreamExecutor device (0): GeForce RTX 2080 Ti, Compute Capability 7.5
2020-07-01 15:50:46.597894: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1618] Found device 0 with properties: 
name: GeForce RTX 2080 Ti major: 7 minor: 5 memoryClockRate(GHz): 1.545
pciBusID: 0000:05:00.0
2020-07-01 15:50:46.597961: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.0
2020-07-01 15:50:46.597988: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10.0
2020-07-01 15:50:46.598014: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcufft.so.10.0
2020-07-01 15:50:46.598040: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcurand.so.10.0
2020-07-01 15:50:46.598065: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusolver.so.10.0
2020-07-01 15:50:46.598090: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusparse.so.10.0
2020-07-01 15:50:46.598115: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7
2020-07-01 15:50:46.599766: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1746] Adding visible gpu devices: 0
2020-07-01 15:50:46.600611: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.0
2020-07-01 15:50:46.603713: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1159] Device interconnect StreamExecutor with strength 1 edge matrix:
2020-07-01 15:50:46.603751: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1165]      0 
2020-07-01 15:50:46.603763: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1178] 0:   N 
2020-07-01 15:50:46.605917: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1304] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 10311 MB memory) -> physical GPU (device: 0, name: GeForce RTX 2080 Ti, pci bus id: 0000:05:00.0, compute capability: 7.5)
Train on 60000 samples
Epoch 1/20
2020-07-01 15:50:49.995091: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10.0
60000/60000 [==============================] - 2s 26us/sample - loss: 9.9370
Epoch 2/20
60000/60000 [==============================] - 0s 4us/sample - loss: 0.6094
Epoch 3/20
60000/60000 [==============================] - 0s 4us/sample - loss: 0.3672
Epoch 4/20
60000/60000 [==============================] - 0s 4us/sample - loss: 0.2720
Epoch 5/20
60000/60000 [==============================] - 0s 4us/sample - loss: 0.2196
Epoch 6/20
60000/60000 [==============================] - 0s 4us/sample - loss: 0.1673
Epoch 7/20
60000/60000 [==============================] - 0s 4us/sample - loss: 0.1367
Epoch 8/20
60000/60000 [==============================] - 0s 4us/sample - loss: 0.1082
Epoch 9/20
60000/60000 [==============================] - 0s 4us/sample - loss: 0.0895
Epoch 10/20
60000/60000 [==============================] - 0s 4us/sample - loss: 0.0781
Epoch 11/20
60000/60000 [==============================] - 0s 4us/sample - loss: 0.0666
Epoch 12/20
60000/60000 [==============================] - 0s 4us/sample - loss: 0.0537
Epoch 13/20
60000/60000 [==============================] - 0s 4us/sample - loss: 0.0459
Epoch 14/20
60000/60000 [==============================] - 0s 4us/sample - loss: 0.0412
Epoch 15/20
60000/60000 [==============================] - 0s 4us/sample - loss: 0.0401
Epoch 16/20
60000/60000 [==============================] - 0s 4us/sample - loss: 0.0318
Epoch 17/20
60000/60000 [==============================] - 0s 4us/sample - loss: 0.0275
Epoch 18/20
60000/60000 [==============================] - 0s 4us/sample - loss: 0.0237
Epoch 19/20
60000/60000 [==============================] - 0s 4us/sample - loss: 0.0212
Epoch 20/20
60000/60000 [==============================] - 0s 4us/sample - loss: 0.0199
Out[1]: <tensorflow.python.keras.callbacks.History at 0x7f7c9000b550>

In [2]: import numpy as np 
   ...:  
   ...: W = model.get_weights() 
   ...:  
   ...: def predict(X): 
   ...:     X      = X.reshape((X.shape[0],-1))           #Flatten 
   ...:     X      = X @ W[0] + W[1]                      #Dense 
   ...:     X[X<0] = 0                                    #Relu 
   ...:     X      = X @ W[2] + W[3]                      #Dense 
   ...:     X[X<0] = 0                                    #Relu 
   ...:     X      = X @ W[4] + W[5]                      #Dense 
   ...:     X      = np.exp(X)/np.exp(X).sum(1)[...,None] #Softmax 
   ...:     return X 
   ...:                                                                                                                                                                                                            

In [3]: print(model.predict(X[:100]).argmax(1)) 
   ...: print(predict(X[:100]).argmax(1))                                                                                                                                                                          
[5 0 4 1 9 2 1 3 1 4 3 5 3 6 1 7 2 8 6 9 4 0 9 1 1 2 4 3 2 7 3 8 6 9 0 5 6
 0 7 6 1 8 7 9 3 9 8 5 9 3 3 0 7 4 9 8 0 9 4 1 4 4 6 0 4 5 6 1 0 0 1 7 1 6
 3 0 2 1 1 7 5 0 2 6 7 8 3 9 0 4 6 7 4 6 8 0 7 8 3 1]
/home/bobbyocean/anaconda3/bin/ipython3:12: RuntimeWarning: overflow encountered in exp
/home/bobbyocean/anaconda3/bin/ipython3:12: RuntimeWarning: invalid value encountered in true_divide
[5 0 4 1 9 2 1 3 1 4 3 5 3 6 1 7 2 8 6 9 4 0 9 1 1 2 4 3 2 7 3 8 6 9 0 5 6
 0 7 6 1 8 7 9 3 9 8 5 9 3 3 0 7 4 9 8 0 9 4 1 4 4 6 0 4 5 6 1 0 0 1 7 1 6
 3 0 2 1 1 7 5 0 2 6 7 8 3 9 0 4 6 7 4 6 8 0 7 8 3 1]

In [4]: %timeit model.predict(X[:10]).argmax(1)                                                                                                                                                                    
37.7 ms ± 806 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [5]: %timeit predict(X[:10]).argmax(1)                                                                                                                                                                          
361 µs ± 13.8 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

Answer 1

我们观察到主要问题是Eager Execution模式的原因。 我们根据CPU和GPU基础对您的代码和相应的结果进行浅显的了解。 确实numpy不在GPU上运行，因此与tf-gpu不同，它不会遇到任何数据移位开销。

但是，与model. predict相比，使用np定义的predict方法完成的计算速度也很明显。 用model. predict tf. keras tf. keras ，而输入测试集只有10 个样本。 但是，我们不会进行任何深入的分析，就像您可能喜欢阅读的一件艺术品一样。

我的设置如下。 我正在使用Colab环境并检查CPU和GPU模式。

TensorFlow 1.15.2
Keras 2.3.1
Numpy 1.19.5

TensorFlow 2.4.1
Keras 2.4.0
Numpy 1.19.5

TF 1.15.2 - 中央处理器

%tensorflow_version 1.x

import os
os.environ["CUDA_VISIBLE_DEVICES"]="-1"   

import tensorflow as tf
from tensorflow.python.client import device_lib

print(tf.__version__)
print('A: ', tf.test.is_built_with_cuda)
print('B: ', tf.test.gpu_device_name())
local_device_protos = device_lib.list_local_devices()
([x.name for x in local_device_protos if x.device_type == 'GPU'], 
 [x.name for x in local_device_protos if x.device_type == 'CPU'])

TensorFlow 1.x selected.
1.15.2
A:  <function is_built_with_cuda at 0x7f122d58dcb0>
B:  
([], ['/device:CPU:0'])

现在，运行您的代码。

import tensorflow as tf
import keras
print(tf.executing_eagerly()) # False

(X,Y),(Xt,Yt) = keras.datasets.mnist.load_data()
model = keras.models.Sequential([])
model.compile
model.fit

%timeit model.predict(X[:10]).argmax(1) # yours: 10   loops takes 37.7 ms
%timeit predict(X[:10]).argmax(1)       # yours: 1000 loops takes 356 µs

1000 loops, best of 5: 1.07 ms per loop
1000 loops, best of 5: 1.48 ms per loop

我们可以看到执行时间与旧的keras相当。 现在，让我们也用GPU进行测试。

TF 1.15.2 - GPU

%tensorflow_version 1.x

import os
os.environ["CUDA_VISIBLE_DEVICES"]="0"   

import tensorflow as tf
from tensorflow.python.client import device_lib

print(tf.__version__)
print('A: ', tf.test.is_built_with_cuda)
print('B: ', tf.test.gpu_device_name())
local_device_protos = device_lib.list_local_devices()
([x.name for x in local_device_protos if x.device_type == 'GPU'], 
 [x.name for x in local_device_protos if x.device_type == 'CPU'])

1.15.2
A:  <function is_built_with_cuda at 0x7f0b5ad46830>
B:  /device:GPU:0
(['/device:GPU:0'], ['/device:CPU:0'])

...
...
%timeit model.predict(X[:10]).argmax(1) # yours: 10   loops takes 37.7 ms
%timeit predict(X[:10]).argmax(1)       # yours: 1000 loops takes 356 µs

1000 loops, best of 5: 1.02 ms per loop
1000 loops, best of 5: 1.44 ms per loop

现在，这里的执行时间也与旧的keras和无 Eager 模式相当。 现在让我们看看新的tf. keras tf. keras首先使用 Eager 模式，然后我们观察没有 Eager 模式。

TF 2.4.1 - 中央处理器

热切地

import os
os.environ["CUDA_VISIBLE_DEVICES"]="-1"   

import tensorflow as tf
from tensorflow.python.client import device_lib

print(tf.__version__)
print('A: ', tf.test.is_built_with_cuda)
print('B: ', tf.test.gpu_device_name())
local_device_protos = device_lib.list_local_devices()
([x.name for x in local_device_protos if x.device_type == 'GPU'], 
 [x.name for x in local_device_protos if x.device_type == 'CPU'])

2.4.1
A:  <function is_built_with_cuda at 0x7fed85de3560>
B:  
([], ['/device:CPU:0'])

现在，以 Eager 模式运行代码。

import tensorflow as tf
import keras

print(tf.executing_eagerly())  # True
(X,Y),(Xt,Yt) = keras.datasets.mnist.load_data()

model = keras.models.Sequential([ ])
model.compile
model.fit

%timeit model.predict(X[:10]).argmax(1) # yours: 10   loops takes 37.7 ms
%timeit predict(X[:10]).argmax(1)       # yours: 1000 loops takes 356 µs

10 loops, best of 5: 28 ms per loop
1000 loops, best of 5: 1.73 ms per loop

急切地禁用

现在，如果我们禁用 Eager 模式并运行以下相同的代码，那么我们将得到：

import tensorflow as tf
import keras

# # Disables eager execution
tf.compat.v1.disable_eager_execution()
# or, 
# Disables eager execution of tf.functions.
# tf.config.run_functions_eagerly(False)
print(tf.executing_eagerly())
False

(X,Y),(Xt,Yt) = keras.datasets.mnist.load_data()
model = keras.models.Sequential([])
model.compile
model.fit

%timeit model.predict(X[:10]).argmax(1) # yours: 10   loops takes 37.7 ms
%timeit predict(X[:10]).argmax(1)       # yours: 1000 loops takes 356 µs

1000 loops, best of 5: 1.37 ms per loop
1000 loops, best of 5: 1.57 ms per loop

现在，我们可以看到在 new tf. keras中禁用急切模式的执行时间相当。 tf. keras 。 现在，让我们也使用GPU模式进行测试。

TF 2.4.1 - GPU

热切地

import os
os.environ["CUDA_VISIBLE_DEVICES"]="0"   

import tensorflow as tf
from tensorflow.python.client import device_lib

print(tf.__version__)
print('A: ', tf.test.is_built_with_cuda)
print('B: ', tf.test.gpu_device_name())
local_device_protos = device_lib.list_local_devices()
([x.name for x in local_device_protos if x.device_type == 'GPU'], 
 [x.name for x in local_device_protos if x.device_type == 'CPU'])

2.4.1
A:  <function is_built_with_cuda at 0x7f16ad88f680>
B:  /device:GPU:0
(['/device:GPU:0'], ['/device:CPU:0'])

import tensorflow as tf
import keras

print(tf.executing_eagerly()) # True
(X,Y),(Xt,Yt) = keras.datasets.mnist.load_data()
model = keras.models.Sequential([ ])
model.compile
model.fit

%timeit model.predict(X[:10]).argmax(1) # yours: 10   loops takes 37.7 ms
%timeit predict(X[:10]).argmax(1)       # yours: 1000 loops takes 356 µs

10 loops, best of 5: 26.3 ms per loop
1000 loops, best of 5: 1.48 ms per loop

急切地禁用

最后，如果我们禁用 Eager 模式并运行以下相同的代码，我们将得到：

# Disables eager execution
tf.compat.v1.disable_eager_execution()
# or, 
# Disables eager execution of tf.functions.
# tf.config.run_functions_eagerly(False)
print(tf.executing_eagerly()) # False 

(X,Y),(Xt,Yt) = keras.datasets.mnist.load_data()
model = keras.models.Sequential([ ])
model.compile
model.fit

%timeit model.predict(X[:10]).argmax(1) # yours: 10   loops takes 37.7 ms
%timeit predict(X[:10]).argmax(1)       # yours: 1000 loops takes 356 µs

1000 loops, best of 5: 1.12 ms per loop
1000 loops, best of 5: 1.45 ms per loop

和以前一样，执行时间与new tf. keras tf. keras 。 这就是为什么， Eager 模式是导致tf. keras tf. keras比直numpy 。

Answer 2

另一个答案在“如何使tf keras 预测更快”方面更有用，但我认为以下内容可以帮助更多“它在做什么需要这么多时间”？ 即使禁用了急切模式，您也可能想知道执行的样子（例如，提供或不提供batch_size 等）。

要回答这个问题，您可能会发现跟踪分析器很有用。 跟踪执行会增加很多开销（特别是对于有一堆非常轻量级的 python 调用的地方），但总的来说应该让您对正在执行 python 代码的哪一部分有相当多的了解，因为，好吧，它只是准确记录正在发生的事情。 您可以尝试pytracing ，因为它会生成 Chrome 浏览器在其内置chrome://tracing页面上很好地可视化的文件。 要使用它，例如在 google colab 中，您可以执行以下操作：

首先，安装pytracing：

!pip install pytracing

然后生成跟踪：

from pytracing import TraceProfiler
tp = TraceProfiler(output=open('/root/trace.out', 'wt'))
with tp.traced():
  for i in range(2): 
    model.predict(X[:1000], batch_size=1000)

然后下载跟踪：

from google.colab import files
files.download('/root/trace.out')

之后在 Chrome 浏览器中打开chrome://tracing页面，点击“Load”按钮，然后 select trace.out 文件就下载好了。

您将看到类似以下内容 - 您可以单击任何元素，查看 python 的全名 function 和文件它来自 + 所花费的时间（再次，由于跟踪开销，所有这些都高于正常运行):

您可以看到禁用/启用急切执行或更改批处理大小将如何更改 output 并且可以亲自查看花费最多的时间。 从我目前看到的情况来看（在非急切模式下+调用model.predict(X[:1000], batch_size=1000) ）相当多的时间花在：

标准化您的数据（无论是什么意思）：~2.5ms（包括跟踪开销：）：

/usr/local/lib/python3.7/dist-packages/tensorflow/python/keras/engine/training_v1.py:2336:_standardize_user_data

准备回调（即使我们没有设置任何回调）：~2ms（包括跟踪开销）

/usr/local/lib/python3.7/dist-packages/tensorflow/python/keras/callbacks.py:133:configure_callbacks

至于numpy版本没有优化的说法——我不同意。 The numpy implementation here is quite optimized - python is not making any pure python calls in it (execution of predict only results call to functions in C - I couldn't believe it at first but it seems to be the case), so overhead from Python 真的很小。 通过优化 ReLU 的方式并消除额外的分配/释放，您可能会有所收获，但这只会导致非常小的性能改进。

Answer 3

正如其他人所指出的那样，有问题的 Tensorflow 二进制文件是为 GPU 优化而编译的：虽然 GPU 由于具有极高数量的计算核心而在密集的数字处理方面表现出色，但在将数据移回时它们却非常缓慢来回。

当 model 在显卡上执行时，所有必要的数据都必须突发传输到 GPU——它无法访问主机系统的 RAM（主机系统也无法访问视频内存）。 一旦 GPU 完成处理，所有结果都必须运回主机系统。

所有这些数据的移动都需要大量时间。 此外，据我所知，编译为使用 GPU/CUDA 执行的 Tensorflow 二进制文件不包括任何用于在 CPU 上执行的标准优化（例如使用更快的扩展指令集，例如 AVX、AVX2 等）。

因此，您正在比较一个高度 CPU 优化的科学库，它可以处理数据，甚至无需将 go 一半时间返回 RAM（CPU 寄存器和芯片上的缓存存储）； 代码必须在将所有数据发送到显卡并返回之前收集它需要的最后一点。 我还省略了 Tensorflow 引擎盖下进行的所有数据操作。 毕竟，它适用于自己的数据结构。

我想，急切的执行也是效率低下的另一层。

至于部署 Keras 模型的最佳实践，我认为它就像软件中的其他一切一样：过早优化是万恶之源。 如果您不需要它快速和精简，那么让它缓慢、模块化、可重用和直观。 但是，嘿，如果你需要或想要效率，那就给你力量。 Keras 设计用于快速开发和研究，而非生产代码。

简而言之，答案是出于同样的原因 C++ 比 Python 快（因为 Python 解释器具有更多的开销）

Answer 4

而不是 model.predict(input)，尝试简单的 model(input)

TF.Keras model.predict 比直线 Numpy 慢？

问题描述

4 个解决方案

解决方案1
5 已采纳 2021-04-24 00:05:41

TF 1.15.2 - 中央处理器

TF 1.15.2 - GPU

TF 2.4.1 - 中央处理器

TF 2.4.1 - GPU

解决方案2
3 2021-04-25 15:17:10

解决方案3
2 2021-04-19 05:33:49

解决方案4
0 2022-12-27 15:45:12

TF.Keras model.predict 比直线 Numpy 慢？

问题描述

4 个解决方案

解决方案1 5 已采纳 2021-04-24 00:05:41

TF 1.15.2 - 中央处理器

TF 1.15.2 - GPU

TF 2.4.1 - 中央处理器

TF 2.4.1 - GPU

解决方案2 3 2021-04-25 15:17:10

解决方案3 2 2021-04-19 05:33:49

解决方案4 0 2022-12-27 15:45:12

解决方案1
5 已采纳 2021-04-24 00:05:41

解决方案2
3 2021-04-25 15:17:10

解决方案3
2 2021-04-19 05:33:49

解决方案4
0 2022-12-27 15:45:12