I'm testing out my new NVIDIA Titan V, which supports float16 operations. I noticed that during training, float16 is much slower (~800 ms/step) than float32 (~500 ms/step).
To do float16 operations, I changed my keras.json file to:
{
"backend": "tensorflow",
"floatx": "float16",
"image_data_format": "channels_last",
"epsilon": 1e-07
}
Why are the float16 operations so much slower? Do I need to make modifications to my code and not just the keras.json file?
I am using CUDA 9.0, cuDNN 7.0, tensorflow 1.7.0, and keras 2.1.5 on Windows 10. My python 3.5 code is below:
img_width, img_height = 336, 224
train_data_dir = 'C:\\my_dir\\train'
test_data_dir = 'C:\\my_dir\\test'
batch_size=128
datagen = ImageDataGenerator(rescale=1./255,
horizontal_flip=True, # randomly flip the images
vertical_flip=True)
train_generator = datagen.flow_from_directory(
train_data_dir,
target_size=(img_height, img_width),
batch_size=batch_size,
class_mode='binary')
test_generator = datagen.flow_from_directory(
test_data_dir,
target_size=(img_height, img_width),
batch_size=batch_size,
class_mode='binary')
# Architecture of NN
model = Sequential()
model.add(Conv2D(32,(3, 3), input_shape=(img_height, img_width, 3),padding='same',kernel_initializer='lecun_normal'))
model.add(Activation('relu'))
model.add(MaxPooling2D(pool_size=(2, 2)))
model.add(Conv2D(32,(3, 3),padding='same'))
model.add(Activation('relu'))
model.add(MaxPooling2D(pool_size=(2, 2)))
model.add(Conv2D(64,(3, 3),padding='same'))
model.add(Activation('relu'))
model.add(MaxPooling2D(pool_size=(2, 2)))
model.add(Conv2D(64,(3, 3),padding='same'))
model.add(Activation('relu'))
model.add(MaxPooling2D(pool_size=(2, 2)))
model.add(AveragePooling2D(pool_size=(2,2)))
model.add(Flatten())
model.add(Dense(1))
model.add(Activation('sigmoid'))
my_rmsprop = keras.optimizers.RMSprop(lr=0.0001, rho=0.9, epsilon=1e-04, decay=0.0)
model.compile(loss='binary_crossentropy',
optimizer=my_rmsprop,
metrics=['accuracy'])
# Training
nb_epoch = 32
nb_train_samples = 512
nb_test_samples = 512
model.fit_generator(
train_generator,
steps_per_epoch=nb_train_samples/batch_size,
epochs=nb_epoch,
verbose=1,
validation_data=test_generator,
validation_steps=nb_test_samples/batch_size)
# Evaluating on the testing set
model.evaluate_generator(test_generator, nb_test_samples)
From the documentation of cuDNN (section 2.7, subsection Type Conversion ) you can see:
Note: Accumulators are 32-bit integers which wrap on overflow.
and that this holds for the standard INT8 data type of the following: the data input, the filter input and the output.
Under those assumptions, @jiandercy is right that there's a float16 to float32 conversion and then back-conversion before returning the result, and float16
would be slower.
I updated to CUDA 10.0, cuDNN 7.4.1, tensorflow 1.13.1, keras 2.2.4, and python 3.7.3. Using the same code as in the OP, training time was marginally faster with float16 over float32.
I fully expect that a more complex network architecture would show a bigger difference in performance, but I didn't test this.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.