Tensorflow Eager 模式：第一次在 GPU 上執行緩慢

Question

下面的代碼比較了 CPU 與 GPU 的計算時間。 僅對於第一次執行，我在 GPU 上的運行時間比 CPU 慢，並且在所有后續運行中 GPU 更快。 為什么在 GPU 上第一次運行很慢？ 如何在 GPU 上快速運行第一次？

from __future__ import absolute_import, division, print_function

import tensorflow as tf

tf.enable_eager_execution()

import time

def time_matmul(x):
  start = time.time()
  for loop in range(10):
    tf.matmul(x, x)

  result = time.time()-start

  print("10 loops: {:0.2f}ms".format(1000*result))

print("On GPU:")
# Force execution on GPU #0 if available
if tf.test.is_gpu_available():
  with tf.device("GPU:0"): # Or GPU:1 for the 2nd GPU, GPU:2 for the 3rd etc.
    x = tf.random_uniform([1000, 1000])
    assert x.device.endswith("GPU:0")
    time_matmul(x)

# Force execution on CPU
print("On CPU:")
with tf.device("CPU:0"):
  x = tf.random_uniform([1000, 1000])
  assert x.device.endswith("CPU:0")
  time_matmul(x)

Output 首次運行：

On GPU:
10 loops: 443.04ms
On CPU:
10 loops: 100.01ms

Output 在后續運行中：

On GPU:
10 loops: 1.00ms
On CPU:
10 loops: 103.01ms

PS：這與看似相關的問題不同，因為tf.device("GPU:0")已經選擇/device:GPU:0而不是/device:XLA_GPU:0

Answer 1

出於好奇，我在 3 年后嘗試了 OP 腳本。 同樣的情況也發生在最新版本的 TF，CUDA（還是舊的 GTX1050 卡）上。 一個可能的解釋是數據移動。

在第一次運行時——GPU 或 CPU——數據四處移動，准備采取行動。 眾所周知，數據移動會顯着減慢速度。 CPU memory 在物理上比 GPU memory 更“接近”，后者通常在外部板上。 默認的計算是 CPU 和它的 memory，所以程序幾乎可以運行 CPU ——幾乎沒有或根本沒有移動，並且基本上保持在同一個芯片上。 GPU memory 在物理上是一個不同的芯片，“距離很遠”，因此移動到那里可能需要更多時間。

這種想法可以通過循環 OP 腳本來支持（稍作更改以匹配 TF2.9.1）：

import tensorflow as tf

tf.compat.v1.enable_eager_execution()

import time

def time_matmul(run, x):
  start = time.time()
  for loop in range(10):
    tf.matmul(x, x)
  result = time.time()-start
  print(f"Run #{run}: {1000*result:0.2f}ms")

print("On GPU:")
# Force execution on GPU #0 if available
if tf.test.is_gpu_available():
  with tf.device("GPU:0"): # Or GPU:1 for the 2nd GPU, GPU:2 for the 3rd etc.
    x = tf.random.uniform([1000, 1000])
    assert x.device.endswith("GPU:0")
    for run in range(10):
        time_matmul(run, x)

# Force execution on CPU
print("On CPU:")
with tf.device("CPU:0"):
  x = tf.random.uniform([1000, 1000])
  assert x.device.endswith("CPU:0")
  for run in range(10):
      time_matmul(run, x)

結果是：

Run #0: 273.66ms
Run #1: 0.37ms
Run #2: 0.36ms
Run #3: 0.36ms
Run #4: 0.37ms
Run #5: 0.36ms
Run #6: 0.35ms
Run #7: 0.41ms
Run #8: 0.37ms
Run #9: 0.35ms
On CPU:
Run #0: 56.89ms
Run #1: 44.31ms
Run #2: 47.60ms
Run #3: 46.97ms
Run #4: 46.40ms
Run #5: 44.84ms
Run #6: 43.88ms
Run #7: 45.28ms
Run #8: 43.46ms
Run #9: 43.57ms

目測會發生什么（一個適當的統計方法會運行很多次，完成但沒有更多洞察力）第一次運行很慢，但隨后更快，更重要的是穩定。 穩定性是我們首先期望的（運行相同應該表現相同），但第一次運行需要通過將數據放置在 memory 中的“正確”位置來設置。

我不知道 API 可以手動放置數據，然后開始運行。 但這將是一種“錯覺”。 此處的運行#0 包括運動和計算。 將兩者分開可能會使 Run #0 與所有其他運行一樣快，但我們仍然必須事先移動數據——所需的時間不會顯示在結果表中......

請注意這個 memory 運動是一個可能的原因（這里是溯因推理），並且可能還有其他事情發生。 該想法得到腳本結果的支持，但它只能得出結論 memory 運動是可能的原因。 這個帖子證明不了。 使用探查器進行適當的分析以獲取根本原因需要更多時間（並且 Python 探查器可能還不夠）。

除了這個免責聲明，它看起來真的像我們在這里觀察到的 memory 機芯成本。

Tensorflow Eager 模式：第一次在 GPU 上執行緩慢

問題描述

1 個解決方案

解決方案1
0 2022-09-15 02:53:22

Tensorflow Eager 模式：第一次在 GPU 上執行緩慢

問題描述

1 個解決方案

解決方案1 0 2022-09-15 02:53:22

解決方案1
0 2022-09-15 02:53:22