Tensorflow performance (versions 1 vs 2 and CPU vs GPU)

Question

I'm new to Machine Learning and found myself spending a disproportionate amount of time setting up Tensorflow. I use Anaconda to manage the different versions as environments. I managed to install

Tensorflow-cpu_1.14.0
Tensorflow-gpu_1.14.0
Tensorflow-cpu_2.0.0-beta1.

I did not manage to set up Tensorflow-gpu_2.0.0-beta due to some issues with the CUDA drivers and I've given up on this for the moment.

My goal is to make sure the three versions specified above are working properly and using all available resources on my system. In particular, my questions are:

How does one reliably measure the performance of an existing computer and Tensorflow setup?
Is it normal that for the example I use the CPU-only versions are faster?
How should I go about selecting and installing the optimal Tensorflow setup for my system?

I read many topics dealing with performance issues on Windows and comparisons between GPU and CPU runtimes, but none seemed to address the questions above.

I did not find any single established standard example to test performance with, so I built my own (probably a grave error). I tested all three environments (using the code specified below) on my home computer (Windows 10 Home, x64. Processor: Intel i7-8750 CPU @2.20GHz, 2208Mhz, 6 Cores, 12 Logical Processors. RAM: 16GB. Graphics Card: GeForce RTX 2060). I also ran the test example from 1 , which demonstrated that matrix multiplication is faster using GPU. I assume there is some reason for the discrepancies, that I'm missing. Please feel free to comment on basic errors in thinking that may be apparent from the code.

# python 3.6
import numpy as np
import tensorflow as tf
from tensorflow.python.client import device_lib
from timeit import default_timer as timer

model = tf.keras.Sequential()
model.add(tf.keras.layers.Dense(64, activation='relu'))
model.add(tf.keras.layers.Dense(64, activation='relu'))
model.add(tf.keras.layers.Dense(10, activation='softmax'))

model.compile(optimizer=tf.compat.v1.train.AdamOptimizer(0.001),
              loss='categorical_crossentropy',
              metrics=['accuracy'])


def random_one_hot_labels(shape):
    n, n_class = shape
    classes = np.random.randint(0, n_class, n)
    tmp_labels = np.zeros((n, n_class))
    tmp_labels[np.arange(n), classes] = 1
    return tmp_labels


data = np.random.random((1000, 32))
labels = random_one_hot_labels((1000, 10))

durations = []
for i in range(10):  # run N times
    start = timer()
    model.fit(data, labels, epochs=500, batch_size=32)
    durations.append(timer() - start)

print(f"tf.version.VERSION = {tf.version.VERSION}")
print(f"tf.keras.__version__ = {tf.keras.__version__}")
devices = device_lib.list_local_devices()  # this may allocate all GPU memory ?!
print(f"devices = {[x.name for x in devices]}")
print(f"model.fit durations: {durations}")

The CPU-only versions both outperform the GPU version. Additionally, there is a huge difference between the different Tensorflow versions. Below the outputs of my code using the three different environments:

tf.version.VERSION = 1.14.0
tf.keras.__version__ = 2.2.4-tf
2019-08-26 13:41:15.980626: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1640] Found device 0 with properties:
name: GeForce RTX 2060 major: 7 minor: 5 memoryClockRate(GHz): 1.2
pciBusID: 0000:01:00.0
2019-08-26 13:41:15.986261: I tensorflow/stream_executor/platform/default/dlopen_checker_stub.cc:25] GPU libraries are statically linked, skip dlopen check.
2019-08-26 13:41:15.990784: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1763] Adding visible gpu devices: 0
2019-08-26 13:41:15.993919: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1181] Device interconnect StreamExecutor with strength 1 edge matrix:
2019-08-26 13:41:15.997211: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1187]      0
2019-08-26 13:41:16.000263: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1200] 0:   N
2019-08-26 13:41:16.002807: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1326] Created TensorFlow device (/device:GPU:0 with 4616 MB memory) -> physical GPU (device: 0, name: GeForce RTX 2060, pci bus id: 0000:01:00.0, compute capability: 7.5)
devices = ['/device:CPU:0', '/device:GPU:0']
model.fit durations: [34.81, 33.84, 34.37, 34.21, 34.54, 34.18, 35.09, 33.24, 33.32, 33.54]

-----------------------------------------------------------

tf.version.VERSION = 1.14.0
tf.keras.__version__ = 2.2.4-tf
devices = ['/device:CPU:0']
model.fit durations: [23.48, 23.43, 23.25, 23.71, 23.54, 24.017, 23.43, 24.08, 23.67, 23.94]

-----------------------------------------------------------

tf.version.VERSION = 2.0.0-beta1
tf.keras.__version__ = 2.2.4-tf
devices = ['/device:CPU:0']
model.fit durations: [15.53, 14.87, 14.65, 14.73, 14.68, 14.67, 15.11, 14.71, 15.54, 14.38]

Answer 1

I have been using Tensorflow for a while now so I will try to answer your questions.

One good way to measure your performance is using Tensorboard. It is installed automatically when you install Tensorflow. When training your model, indicate in your code where you want to save your checkpoints. Put them in a folder called "trainings" for example. You want to end up with a folder tree that looks like this: trainings/training_1/my_model.ckpt . Using a terminal, call Tensorboard like this tensorboard --logdir=trainings . Tensorboard looks recursively through the folder, so if you have one folder per training, Tensorboard will show you each training separately without you having to run 1 Tensorboard per training. Tensorboard has graphs that indicate mutliple things like the accuracy of the training, the time spent computing, the learning rate etc. As you can see in the next image, I am able to tell that training #1 was faster than #2 by 15mn: See Tensorboard graph example image below. If you do not know how to save checkpoints, you can look at this link .
Looking at you GPU's computing capability, it should have a better fit duration than the CPU. What versions of CUDA and cuDNN do you use ?
Unfortunately, it depends on what you are doing. It is usually better to use the latest release but it can have bugs that weren't in the last versions. I'd go with what you're already doing and create virtual environments for each version I want to use. Keep in mind that if you export your frozen inference graph it will only be able to be used in inference by the same version of Tensorflow you were using when exporting. So if I export my graph while using Tensorflow 1.14, I won't be able to make any inference with Tensorflow 1.13.

Answer 2

I tested the same code with different net sizes. It turns out that, when using bigger sized networks, the GPU version performs much better than the CPU-only version. I suspect this is due to overhead coming from loading data into GPU memory.

If you want to test this, use eg 1024 nodes per layer in the above code (and reduce the number of epochs).

Tensorflow performance (versions 1 vs 2 and CPU vs GPU)

Question

2 answers

solution1
1 2019-08-26 13:30:09

solution2
1 ACCPTED 2019-08-28 08:47:01

Tensorflow performance (versions 1 vs 2 and CPU vs GPU)

Question

2 answers

solution1 1 2019-08-26 13:30:09

solution2 1 ACCPTED 2019-08-28 08:47:01

solution1
1 2019-08-26 13:30:09

solution2
1 ACCPTED 2019-08-28 08:47:01