简体   繁体   English

Floras16比keras中的float32慢

[英]Float16 slower than float32 in keras

I'm testing out my new NVIDIA Titan V, which supports float16 operations. 我正在测试我的新NVIDIA Titan V,它支持float16操作。 I noticed that during training, float16 is much slower (~800 ms/step) than float32 (~500 ms/step). 我注意到在训练期间,float16比float32(~500 ms /步)慢得多(~800 ms /步)。

To do float16 operations, I changed my keras.json file to: 要执行float16操作,我将keras.json文件更改为:

{
"backend": "tensorflow",
"floatx": "float16",
"image_data_format": "channels_last",
"epsilon": 1e-07
}

Why are the float16 operations so much slower? 为什么float16操作这么慢? Do I need to make modifications to my code and not just the keras.json file? 我是否需要修改我的代码而不仅仅是keras.json文件?

I am using CUDA 9.0, cuDNN 7.0, tensorflow 1.7.0, and keras 2.1.5 on Windows 10. My python 3.5 code is below: 我在Windows 10上使用CUDA 9.0,cuDNN 7.0,tensorflow 1.7.0和keras 2.1.5。我的python 3.5代码如下:

img_width, img_height = 336, 224

train_data_dir = 'C:\\my_dir\\train'
test_data_dir = 'C:\\my_dir\\test'
batch_size=128

datagen = ImageDataGenerator(rescale=1./255,
    horizontal_flip=True,   # randomly flip the images 
    vertical_flip=True) 

train_generator = datagen.flow_from_directory(
    train_data_dir,
    target_size=(img_height, img_width),
    batch_size=batch_size,
    class_mode='binary')

test_generator = datagen.flow_from_directory(
    test_data_dir,
    target_size=(img_height, img_width),
    batch_size=batch_size,
    class_mode='binary')

# Architecture of NN
model = Sequential()
model.add(Conv2D(32,(3, 3), input_shape=(img_height, img_width, 3),padding='same',kernel_initializer='lecun_normal'))
model.add(Activation('relu'))
model.add(MaxPooling2D(pool_size=(2, 2)))

model.add(Conv2D(32,(3, 3),padding='same'))
model.add(Activation('relu'))
model.add(MaxPooling2D(pool_size=(2, 2)))

model.add(Conv2D(64,(3, 3),padding='same'))
model.add(Activation('relu'))
model.add(MaxPooling2D(pool_size=(2, 2)))

model.add(Conv2D(64,(3, 3),padding='same'))
model.add(Activation('relu'))
model.add(MaxPooling2D(pool_size=(2, 2)))

model.add(AveragePooling2D(pool_size=(2,2)))
model.add(Flatten())
model.add(Dense(1))
model.add(Activation('sigmoid'))

my_rmsprop = keras.optimizers.RMSprop(lr=0.0001, rho=0.9, epsilon=1e-04, decay=0.0)
model.compile(loss='binary_crossentropy',
          optimizer=my_rmsprop,
          metrics=['accuracy'])

# Training 
nb_epoch = 32
nb_train_samples = 512
nb_test_samples = 512

model.fit_generator(
    train_generator,
    steps_per_epoch=nb_train_samples/batch_size,
    epochs=nb_epoch,
    verbose=1,
    validation_data=test_generator,
    validation_steps=nb_test_samples/batch_size)

# Evaluating on the testing set
model.evaluate_generator(test_generator, nb_test_samples)

From the documentation of cuDNN (section 2.7, subsection Type Conversion ) you can see: cuDNN文档 (第2.7节, 类型转换子节),您可以看到:

Note: Accumulators are 32-bit integers which wrap on overflow. 注意:累加器是32位整数,它们包含溢出。

and that this holds for the standard INT8 data type of the following: the data input, the filter input and the output. 并且这适用于以下标准INT8数据类型:数据输入,滤波器输入和输出。

Under those assumptions, @jiandercy is right that there's a float16 to float32 conversion and then back-conversion before returning the result, and float16 would be slower. 在这些假设下,@ jiandercy是正确的,有一个float16到float32转换然后在返回结果之前进行反向转换,而float16会更慢。

I updated to CUDA 10.0, cuDNN 7.4.1, tensorflow 1.13.1, keras 2.2.4, and python 3.7.3. 我更新到CUDA 10.0,cuDNN 7.4.1,tensorflow 1.13.1,keras 2.2.4和python 3.7.3。 Using the same code as in the OP, training time was marginally faster with float16 over float32. 使用与OP中相同的代码,使用float16 over float32,训练时间稍微快一些。

I fully expect that a more complex network architecture would show a bigger difference in performance, but I didn't test this. 我完全相信更复杂的网络架构会在性能上表现出更大的差异,但我没有对此进行测试。

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 从float32到float16的numpy astype - numpy astype from float32 to float16 numpy float64、float32 和 float16 标量都使用相同的字节数 - numpy float64, float32, and float16 scalars all use the same number of bytes 将torchaudio加载的16位音频从`float32`截断到`float16`是否安全? - Is it safe to truncate torchaudio's loaded 16-bit audios to `float16` from `float32`? float8、float16、float32、float64、float128 可以包含多少位? - How many digits can float8, float16, float32, float64, and float128 contain? 是否可以在不强制转换的情况下初始化 float32 或 float16 的随机 arrays ? - Is it possible to initialize a random arrays of float32 or float16 without casting? 如何运行定义Tensorflow图,所有变量都在float16中而不是float32中 - How to run define Tensorflow graph were all variables are in float16 instead instead of float32 从二进制文件读取 numpy 数组作为 float16 而不是 float32 重塑输入 - Reading numpy array from binary file as float16 instead of float32 reshapes the input 在pandas中,如何将一列float32值转换为float16值? - In pandas, how to convert a column of float32 values to float16 values? TensorFlow TypeError:传递给参数输入的值的 DataType uint8 不在允许值列表中:float16、float32 - TensorFlow TypeError: Value passed to parameter input has DataType uint8 not in list of allowed values: float16, float32 错误:TypeError:传递给参数“输入”的值的 DataType uint8 不在允许值列表中:float16、bfloat16、float32、float64、int32 - Error: TypeError: Value passed to parameter 'input' has DataType uint8 not in list of allowed values: float16, bfloat16, float32, float64, int32
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM