简体   繁体   中英

Tensorflow Eager mode: First execution on GPU slow

The code below compares computation time on the CPU vs GPU. Only for the first execution, I get slower runtime on GPU than CPU, and in all subsequent runs the GPU is much faster. Why is the first run on GPU slow? How do I make even the first run on GPU fast?

from __future__ import absolute_import, division, print_function

import tensorflow as tf

tf.enable_eager_execution()

import time

def time_matmul(x):
  start = time.time()
  for loop in range(10):
    tf.matmul(x, x)

  result = time.time()-start

  print("10 loops: {:0.2f}ms".format(1000*result))

print("On GPU:")
# Force execution on GPU #0 if available
if tf.test.is_gpu_available():
  with tf.device("GPU:0"): # Or GPU:1 for the 2nd GPU, GPU:2 for the 3rd etc.
    x = tf.random_uniform([1000, 1000])
    assert x.device.endswith("GPU:0")
    time_matmul(x)

# Force execution on CPU
print("On CPU:")
with tf.device("CPU:0"):
  x = tf.random_uniform([1000, 1000])
  assert x.device.endswith("CPU:0")
  time_matmul(x)

Output on first run:

On GPU:
10 loops: 443.04ms
On CPU:
10 loops: 100.01ms

Output on subsequent runs:

On GPU:
10 loops: 1.00ms
On CPU:
10 loops: 103.01ms

PS: This is different from a seemingly related question because tf.device("GPU:0") already chooses /device:GPU:0 and not /device:XLA_GPU:0

Out of curiosity I have tried the OP script 3 years later. The same situation happens on the latest version of TF, CUDA (yet an old GTX1050 card). A possible explanation is data movement.

On the first runs---either GPU or CPU---data moves around, ready for action. Data movements are well-known to slow things down significantly. CPU memory is physically "closer" than GPU memory, the latter usually on an external board. The default compute is the CPU and its memory, so a program is almost ready for CPU runs---little or nothing to move, and basically staying on the same chip. GPU memory is physically a different chip, "far away", so moving there is likely to take much more time.

This thinking can be supported by looping over the OP script (slight change to match TF2.9.1):

import tensorflow as tf

tf.compat.v1.enable_eager_execution()

import time

def time_matmul(run, x):
  start = time.time()
  for loop in range(10):
    tf.matmul(x, x)
  result = time.time()-start
  print(f"Run #{run}: {1000*result:0.2f}ms")

print("On GPU:")
# Force execution on GPU #0 if available
if tf.test.is_gpu_available():
  with tf.device("GPU:0"): # Or GPU:1 for the 2nd GPU, GPU:2 for the 3rd etc.
    x = tf.random.uniform([1000, 1000])
    assert x.device.endswith("GPU:0")
    for run in range(10):
        time_matmul(run, x)

# Force execution on CPU
print("On CPU:")
with tf.device("CPU:0"):
  x = tf.random.uniform([1000, 1000])
  assert x.device.endswith("CPU:0")
  for run in range(10):
      time_matmul(run, x)

Which results in:

Run #0: 273.66ms
Run #1: 0.37ms
Run #2: 0.36ms
Run #3: 0.36ms
Run #4: 0.37ms
Run #5: 0.36ms
Run #6: 0.35ms
Run #7: 0.41ms
Run #8: 0.37ms
Run #9: 0.35ms
On CPU:
Run #0: 56.89ms
Run #1: 44.31ms
Run #2: 47.60ms
Run #3: 46.97ms
Run #4: 46.40ms
Run #5: 44.84ms
Run #6: 43.88ms
Run #7: 45.28ms
Run #8: 43.46ms
Run #9: 43.57ms

Eye-balling what happens (a proper stats approach would run that many times, done but no more insight) the first run is slow, but then faster, and more importantly stable. Stability is what we expect in the first place (run the same should behave the same), but the first run needs to be setup by placing the data at the "right" place in memory.

I am not aware of APIs to place the data manually, and then start the runs. But this would be an "illusion". Run #0 here includes movement and computation. Splitting the two would likely make Run #0 as fast as all other runs, yet we still have to move the data beforehand---the required time would just not show in the result table...


Please note this memory movement is a likely cause (abductive reasoning here), and there might be something else happening. The thinking is supported by the script result, yet it just allows to conclude memory movement is a likely cause. This post proves nothing. A proper analysis to get the root cause would require more time with a profiler (and a Python profiler may not be enough).

Aside this disclaimer, it really looks like a memory movement cost we observe here.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM