简体   繁体   中英

performance measurement in Tensorflow's eager mode

In tensorflow's guide about the performance of eager execution , there is a piece of code as follows:

import time
def measure(x, steps):
  # TensorFlow initializes a GPU the first time it's used, exclude from timing.
  tf.matmul(x, x)
  start = time.time()
  for i in range(steps):
      x = tf.matmul(x, x)
      _ = x.numpy()  # Make sure to execute op and not just enqueue it
  end = time.time()
 return end - start
...
with tf.device("/cpu:0"):
    print("CPU: {} secs".format(measure(tf.random_normal(shape), steps)))

with tf.device("/gpu:0"):
    print("GPU: {} secs".format(measure(tf.random_normal(shape), steps)))

what is the meaning of the code before the second comment: "_ = x.numpy()"?
If I comment out this line, will tf.matmul(x,x) not be executed on cpu/gpu?

Technically, the call to tf.matmul can return before the matrix multiplication is complete.

In practice:

  • If executing on the CPU (and not using execution_mode=tf.contrib.eager.ASYNC ), then tf.matmul returns only after the matrix multiplication has completed.
  • If executing on the GPU, then tf.matmul returns after enqueueing the matrix multiplication on the CUDA stream (see NVIDIA developer documentation for more information on streams)

The .numpy() call causes the result to be copied back to host memory (since numpy arrays must be backed by host and not GPU memory). In order to correctly do that, it has to wait for all compute operations enqueued on the CUDA stream to complete. Thus the .numpy() call is a means of ensuring "the CUDA stream has been processed". The intent there is to ensure that end - start accounts for the time it takes the operation to complete, not just be enqueued on the CUDA stream.

That said, that code snippet seems like it is over-estimating time executed on GPU since it also includes the time to copy to host after each step. That _ = x.numpy() line should be moved outside the for loop to get a more accurate measure (ie, time to execute matrix multiplication steps times, then wait for the CUDA stream to finish, and copy to host memory once). Ideally, we would be able to exclude the time it takes to copy back to host memory.

Hope that makes sense.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM