使用Tensorflow数据集在GPU上进行模型训练时的周期性开销

Question

As you can see it in the following code, I am trying to train a simple model on Tensorflow with a Tensorflow Dataset. 如您在以下代码中看到的那样，我正在尝试使用Tensorflow数据集在Tensorflow上训练一个简单的模型。 The dataset is pretty huge and I suffle , repeat and batch it in order to do a stochastic gradien descent for training my model. 数据集非常庞大，我将其拖拉，重复和批处理，以进行随机梯度下降以训练模型。

But I can observe a period overhead of the optimisation step (it is sess.run(train) in my code). 但是我可以观察到优化步骤的周期开销（在我的代码中为sess.run（train））。

As you can see it here, every 5 steps, it needs 3s instead of 0.5 to do the optimisation. 正如您在此处看到的那样，每5步进行优化需要3秒而不是0.5秒。

Step 105 duration : 3.5233473777770996 步骤105的持续时间：3.5233473777770996

Step 106 duration : 0.5653283596038818 步骤106持续时间：0.5653283596038818

Step 107 duration : 0.5391891002655029 步骤107持续时间：0.5391891002655029

Step 108 duration : 0.5480048656463623 步骤108持续时间：0.5480048656463623

Step 109 duration : 0.0415492057800293 步骤109持续时间：0.0415492057800293

Step 110 duration : 3.032115936279297 步骤110的持续时间：3.032115936279297

Step 111 duration : 0.5407207012176514 步骤111持续时间：0.5407207012176514

Step 112 duration : 0.5276811122894287 步骤112持续时间：0.5276811122894287

Step 113 duration : 0.5448746681213379 步骤113持续时间：0.5448746681213379

Step 114 duration : 0.04253268241882324 步骤114持续时间：0.04253268241882324

Step 115 duration : 3.1273345947265625 步骤115的持续时间：3.1273345947265625

Moreover my GPU is almost all the time at 0% utilisation with around 90% of the memory used. 此外，我的GPU几乎始终都处于0％的利用率，并且使用了大约90％的内存。

It seems that this overhead arrived when the Iterator finish to see all the dataset. 当Iterator完成查看所有数据集时，似乎已经达到了开销。

I am using Python 3.6 with Tensorflow 1.4 on Ubuntu 16.04. 我在Ubuntu 16.04上将Python 3.6与Tensorflow 1.4一起使用。

Do you have any idea how I can speed up my training ? 您知道我如何加快培训速度吗？

Best, 最好，

import tensorflow as tf
import numpy as np
import os, time, multiprocessing
import matplotlib.pyplot as plt

def _floats_feature(value):
    return tf.train.Feature(float_list=tf.train.FloatList(value=value.reshape(-1)))


def parser(record):
    num_features = 2000
    size_group = 300
    num_classes= 10
    class_indice = 0
    keys_to_features={
                'X': tf.FixedLenFeature([size_group*num_features],tf.float32),
                'label' : tf.FixedLenFeature([num_classes],tf.float32)}
    parsed = tf.parse_single_example(record, keys_to_features)

    label = parsed['label']
    label = tf.slice(label,[class_indice],[1])
    label = tf.squeeze(label) # To get a vector one dimension
    X = parsed['X']
    X= tf.reshape(X, [size_group,num_features])
    return X, label


def test_train_w_dataset():

    # Definition of the size 
    num_features = 2000
    num_ex = 2000
    size_group = 300
    num_classes = 10
    batch_size= 480
    max_iters = 300
    buffer_size = 10000

# Creation of the Dataset 
filename_tfrecords = 'tmp.tfrecords'
if not(os.path.isfile(filename_tfrecords)): # If the file doesn't exist we will create it
    print("Start creating the Dataset")
    writer = tf.python_io.TFRecordWriter(filename_tfrecords)

    for i in range(num_ex):
        if i % 1000 == 0: print("Step :",i)
        X = np.random.normal(size=(size_group,num_features))
        vectors =  2*np.random.randint(0,2,(num_classes,1))-1
        features=tf.train.Features(feature={
                    'X': _floats_feature(X),
                    'label' : _floats_feature(vectors)})
        example = tf.train.Example(features=features)       
        writer.write(example.SerializeToString())
    writer.close()
else:
    print("The dataset tfrecords already exist")

train_dataset = tf.data.TFRecordDataset(filename_tfrecords)
num_proc = multiprocessing.cpu_count()
train_dataset = train_dataset.map(parser,
                                    num_parallel_calls=num_proc)
dataset_shuffle = train_dataset.shuffle(buffer_size=buffer_size,
                                             reshuffle_each_iteration=True) 
dataset_shuffle = dataset_shuffle.batch(batch_size)
dataset_shuffle = dataset_shuffle.repeat() 
dataset_shuffle = dataset_shuffle.prefetch(batch_size) 
shuffle_iterator = dataset_shuffle.make_initializable_iterator()
X_, y_ = shuffle_iterator.get_next()

W=tf.Variable(tf.random_normal([num_features], stddev=1.),name="weights")
W=tf.reshape(W,(1,1,num_features))
Prod=tf.reduce_sum(tf.multiply(W,X_),axis=2)
Max=tf.reduce_max(Prod,axis=1)
Tan= tf.reduce_sum(tf.multiply(tf.tanh(Max),y_))
loss= tf.add(Tan,tf.reduce_sum(tf.multiply(W,W)))

LR = 0.01
restarts = 1
optimizer = tf.train.GradientDescentOptimizer(LR) 
config = tf.ConfigProto()
config.gpu_options.allow_growth = True
train = optimizer.minimize(loss)  
print("The graph is defined")
sess = tf.Session(config=config)

durationTab = []

for essai in range(restarts+1):
    # To do need to reinitialiszed
    t0 = time.time()
    sess.run(tf.global_variables_initializer())
    sess.run(tf.local_variables_initializer())
    sess.run(shuffle_iterator.initializer)
    t1 = time.time()
    duration = t1 - t0
    print('Duration of initialization : ',duration)
    for step in range(max_iters):
        t0 = time.time()
        sess.run(train)
        t1 = time.time()
        duration = t1 - t0
        print("Step ",str(step),' duration : ',duration)
        durationTab += [duration]


plt.plot(durationTab)
plt.ylabel('Duration')
plt.xlabel('Iteration')
plt.show()

if __name__ == '__main__':

    test_train_w_dataset()

Answer 1

For GPU utilization, make sure you use the gpu optimized binary. 对于GPU利用率，请确保使用gpu优化的二进制文件。 Check operation placement (in tensorboard for example). 检查操作位置（例如在张量板上）。 Force placement of the operations on the gpu (see tf.device). 强制将操作放置在GPU上（请参阅tf.device）。

For the periodic spikes there could be a few reasons: 对于周期性的峰值，可能有以下几个原因：

Other processes block access to CPU/GPU/RAM/Disk and you need to wait for that to pass. 其他进程会阻止对CPU / GPU / RAM /磁盘的访问，您需要等待它通过。 You can try to kill other superfluous tasks that might be running on your system. 您可以尝试杀死系统上可能正在运行的其他多余任务。
You run out of ram. 您用完了内存。 Check how much of the swap space is used. 检查使用了多少交换空间。 If it's growing when you run then the spikes might be just system thrashing, though it looks too well behaved for that. 如果运行时它在增长，则峰值可能只是系统的抖动，尽管这样做看起来表现得很好。
Disk access. 磁盘访问。 You mentioned that this sees to correlate with looping over the data. 您提到，这似乎与遍历数据相关。 It could be the system just needs to read the data again so you need to wait for the disk, though usually that's not visible. 可能是系统只需要再次读取数据，因此您需要等待磁盘，尽管通常这是不可见的。 You could speed it up by making sure the data is contiguous on hard-drive, move it to an SSD or RAM. 您可以通过确保数据在硬盘驱动器上连续并将其移动到SSD或RAM来加快速度。

Since a lot of the reasons have to do with RAM, you should probably try a smaller model (smaller batches, less layers, less nodes/layer) and see if it goes away. 由于很多原因与RAM有关，因此您可能应该尝试使用较小的模型（较小的批次，较少的层，较少的节点/层），并查看其是否消失。 If it does then you need to go out and buy more RAM. 如果是这样，那么您需要出去购买更多的RAM。

Answer 2

It seems that adding dataset_shuffle = dataset_shuffle.cache() between the batch and repeat function remove those periodic overhead. 似乎在批处理和重复函数之间添加dataset_shuffle = dataset_shuffle.cache（）消除了这些周期性开销。 Nevertheless, I am not sure that the Dataset is fully read with the use of this command. 但是，我不确定使用此命令是否已完全读取数据集。

使用Tensorflow数据集在GPU上进行模型训练时的周期性开销

问题描述

2 个解决方案

解决方案1
0 2018-04-19 13:11:39

解决方案2
0 2018-04-27 11:54:18

使用Tensorflow数据集在GPU上进行模型训练时的周期性开销

问题描述

2 个解决方案

解决方案1 0 2018-04-19 13:11:39

解决方案2 0 2018-04-27 11:54:18

解决方案1
0 2018-04-19 13:11:39

解决方案2
0 2018-04-27 11:54:18