如何为我的tensorflow模型提高此数据管道的性能

Question

I have a tensorflow model which I am training on google-colab . 我有一个tensorflow模型，我正在google-colab上训练。 The actual model is more complex, but I condensed it into a reproducible example (removed saving/restoring, learning rate decay, asserts, tensorboard events, gradient clipping and so on). 实际模型更复杂，但我将其浓缩为可重复的示例（删除了保存/恢复，学习速率衰减，断言，张量事件，渐变剪切等）。 The model works reasonably (converges to acceptable loss) and I am looking for a way to speed up the training (iterations per second). 该模型合理地工作（收敛到可接受的损失），我正在寻找一种加速训练的方法 （每秒迭代次数）。

Currently on colab's GPU it takes 10 minutes to train for 1000 iteration . 目前在colab的GPU上，需要10分钟来训练1000次迭代 。 With my current batch size of 512 it means that the model processes ~850 examples per second (I would prefer to have a batch size of 512 unless other sizes provide reasonable speedup. By itself changing batch size does not change the speed). 我目前的批量大小为512，这意味着该模型每秒处理约850个示例 （我希望批量大小为512，除非其他大小提供合理的加速。本身改变批量大小不会改变速度）。

So currently I have a data stored in tfrecord format: here is a 500Mb example file , the total data-size is ~0.5Tb. 所以目前我有一个以tfrecord格式存储的数据：这是一个500Mb的示例文件，总数据大小约为0.5Tb。 This data passes through a reasonably heavy preprocessing step (I can't do preprocessing beforehand as it will increase the size of my tfrecords way above what I can afford). 这些数据通过了一个相当繁重的预处理步骤（我不能事先进行预处理，因为它会增加我的tfrecords的大小超出我能承受的范围）。 Preprocessing is done via tf.data and the output tensors ( (batch_size, 8, 8, 24) which is treated as NHWC, (batch_size, 10) ) are passed into a model. 预处理通过tf.data完成，输出张量（ (batch_size, 8, 8, 24) 8,8,24 (batch_size, 8, 8, 24)被视为NHWC， (batch_size, 10) ）被传递到模型中。 The example colab does not contain a simplified model which serves just as an example. 示例colab不包含简化模型，仅作为示例。

I tried a few approaches to speedup the training: 我尝试了一些方法来加速训练：

manual device placement (data pre-processing on cpu, propagations on gpu), but all my attempts resulted in worse speed (from 10% to 50% increase). 手动设备放置（cpu上的数据预处理，gpu上的传播），但我的所有尝试都导致速度更慢（从10％增加到50％）。
improve data preprocessing. 改善数据预处理。 I reviewed tf.data video and data tutorials . 我查看了tf.data视频和数据教程。 I tried almost every technique from that tutorial got no improvement (decrease in speed from 0% to 15%). 我尝试了几乎所有教程中的技术都没有改进（速度从0％降低到15％）。 In particular I tried: 特别是我试过：
- dataset.prefetch(...)
- passing num_parallel_calls to map 将num_parallel_calls传递给map
- combining map and batch in tf.contrib.data.map_and_batch 在tf.contrib.data.map_and_batch组合地图和批处理
- using parallel_interleave 使用parallel_interleave

The code related to data preprocessing is here (here is a full reproducible example with example data ): 与数据预处理相关的代码在这里（这是一个完整的可重复示例，带有示例数据）：

_keys_to_map = {
    'd': tf.FixedLenFeature([], tf.string),  # data
    's': tf.FixedLenFeature([], tf.int64),   # score
}


def _parser(record):][3]
    parsed = tf.parse_single_example(record, _keys_to_map)
    return parsed['d'], parsed['s']


def init_tfrecord_dataset():
  files_train = glob.glob(DIR_TFRECORDS + '*.tfrecord')
  random.shuffle(files_train)

  with tf.name_scope('tfr_iterator'):
    ds = tf.data.TFRecordDataset(files_train)      # define data from randomly ordered files
    ds = ds.shuffle(buffer_size=10000)             # select elements randomly from the buffer
    ds = ds.map(_parser)                           # map them based on tfrecord format
    ds = ds.batch(BATCH_SIZE, drop_remainder=True) # group elements in batch (remove batch of less than BATCH_SIZE)
    ds = ds.repeat()                               # iterate infinitely 

    return ds.make_initializable_iterator()        # initialize the iterator


def iterator_to_data(iterator):
  """Creates a part of the graph which reads the raw data from an iterator and transforms it to a 
  data ready to be passed to model.

  Args:
    iterator      - iterator. Created by `init_tfrecord_dataset`

  Returns:
    data_board      - (BATCH_SIZE, 8, 8, 24) you can think about as NWHC for images.
    data_flags      - (BATCH_SIZE, 10)
    combined_score  - (BATCH_SIZE,)
  """

  b = tf.constant((128, 64, 32, 16, 8, 4, 2, 1), dtype=tf.uint8, name='unpacked_const')

  with tf.name_scope('tfr_parse'):
    with tf.name_scope('packed_data'):
      next_element = iterator.get_next()
      data_packed, score_int = next_element
      score = tf.cast(score_int, tf.float64, name='score_float')

    # https://stackoverflow.com/q/45454470/1090562
    with tf.name_scope('data_unpacked'):
      data_unpacked = tf.reshape(tf.mod(tf.to_int32(tf.decode_raw(data_packed, tf.uint8)[:,:,None] // b), 2), [BATCH_SIZE, 1552], name='data_unpack')

    with tf.name_scope('score'):
      with tf.name_scope('is_mate'):
        score_is_mate = tf.cast(tf.squeeze(tf.slice(data_unpacked, [0, 1546], [BATCH_SIZE, 1])), tf.float64, name='is_mate')
      with tf.name_scope('combined'):
        combined_score = (1 - score_is_mate) * VALUE_A * tf.tanh(score / VALUE_K) + score_is_mate * tf.sign(score) * (VALUE_A + (1 - VALUE_A) / (VALUE_B - 1) * tf.reduce_max(tf.stack([tf.zeros(BATCH_SIZE, dtype=tf.float64), VALUE_B - tf.abs(score)]), axis=0))


    with tf.name_scope('board'):
      with tf.name_scope('reshape_layers'):
        data_board = tf.reshape(tf.slice(data_unpacked, [0, 0], [BATCH_SIZE, 8 * 8 * 24]), [BATCH_SIZE, 8, 8, 24], name='board_reshape')

      with tf.name_scope('combine_layers'):  
        data_board = tf.cast(tf.stack([
          data_board[:,:,:, 0],
          data_board[:,:,:, 4],
          data_board[:,:,:, 8],
          data_board[:,:,:,12],
          data_board[:,:,:,16],
          data_board[:,:,:,20],
          - data_board[:,:,:, 1],
          - data_board[:,:,:, 5],
          - data_board[:,:,:, 9],
          - data_board[:,:,:,13],
          - data_board[:,:,:,17],
          - data_board[:,:,:,21],
          data_board[:,:,:, 2],
          data_board[:,:,:, 6],
          data_board[:,:,:,10],
          data_board[:,:,:,14],
          data_board[:,:,:,18],
          data_board[:,:,:,22],
          - data_board[:,:,:, 3],
          - data_board[:,:,:, 7],
          - data_board[:,:,:,11],
          - data_board[:,:,:,15],
          - data_board[:,:,:,19],
          - data_board[:,:,:,23],
        ], axis=3), tf.float64, name='board_compact')

    with tf.name_scope('flags'):
      data_flags = tf.cast(tf.slice(data_unpacked, [0, 1536], [BATCH_SIZE, 10]), tf.float64, name='flags')

  return data_board, data_flags, combined_score

I am looking for practical solutions (I have tried significant amount of theoretical ideas) which can improve the the speed of training (in terms of examples/second). 我正在寻找实用的解决方案（我已经尝试了大量的理论思路），这可以提高培训的速度（就例子/秒而言）。 I am not looking for a way to improve the accuracy of the model (or modify the model) as this is just a test model. 我不是在寻找提高模型准确性（或修改模型）的方法，因为这只是一个测试模型。

I have spent significant amount of time trying to optimize this (and gave up). 我花了很多时间试图优化它（并放弃）。 So I would be happy to award a bounty of 200 for a working solution with a nice explanation. 所以我很乐意为一个有效解释的工作解决方案奖励200。

Answer 1

The suggestion from hampi to profile your training job is a good one, and may be necessary to understand the actual bottlenecks in your pipeline. hampi建议你的培训工作是一个很好的建议，可能有必要了解你的管道中的实际瓶颈。 The other suggestions in the Input Pipeline performance guide should be useful as well. 输入管道性能指南中的其他建议也应该是有用的。

However, there is another possible "quick fix" that might be useful. 但是，还有另一种可能有用的“快速修复”。 In some cases, the amount of work in a Dataset.map() transformation can be very small, and dominated by the cost of invoking the function for each element. 在某些情况下， Dataset.map()转换中的Dataset.map()可能非常小，并且由为每个元素调用函数的成本占主导地位。 In those cases, we often try to vectorize the map function, and move it after the Dataset.batch() transformation, in order to invoke the function fewer times (1/512 as many times, in this case), and perform larger—and potentially easier-to-parallelize—operations on each batch. 在这些情况下，我们经常尝试对map函数进行矢量化 ，并在Dataset.batch()转换之后移动它，以便更少次地调用函数（在这种情况下多次调用Dataset.batch() ），并执行更大的操作 -并且可能更容易并行化每个批次的操作。 Fortunately, your pipeline can be vectorized as follows: 幸运的是，您的管道可以按如下方式进行矢量化：

def _batch_parser(record_batch):
  # NOTE: Use `tf.parse_example()` to operate on batches of records.
  parsed = tf.parse_example(record_batch, _keys_to_map)
  return parsed['d'], parsed['s']

def init_tfrecord_dataset():
  files_train = glob.glob(DIR_TFRECORDS + '*.tfrecord')
  random.shuffle(files_train)

  with tf.name_scope('tfr_iterator'):
    ds = tf.data.TFRecordDataset(files_train)      # define data from randomly ordered files
    ds = ds.shuffle(buffer_size=10000)             # select elements randomly from the buffer
    # NOTE: Change begins here.
    ds = ds.batch(BATCH_SIZE, drop_remainder=True) # group elements in batch (remove batch of less than BATCH_SIZE)
    ds = ds.map(_batch_parser)                     # map batches based on tfrecord format
    # NOTE: Change ends here.
    ds = ds.repeat()                               # iterate infinitely 

    return ds.make_initializable_iterator()        # initialize the iterator

Currently, vectorization is a change that you have to make manually, but the tf.data team are working on an optimization pass that provides automatic vectorization . 目前，矢量化是您必须手动进行的更改，但tf.data团队正在处理提供自动矢量化的优化过程。

Answer 2

I have a couple of suggestions: 我有几点建议：

1) After creating the batch, the entire batch is processed by the iterator_to_data() function. 1）创建批处理后，整个批处理由iterator_to_data()函数处理。 This isn't really distributing the task on multiple threads, atleast not at the api level. 这并不是真正在多线程上分配任务，至少不是在api级别。 Instead, you could try something like this in the init_tfrecord_dataset() function: 相反，你可以在init_tfrecord_dataset()函数中尝试这样的事情：

ds = tf.data.TFRecordDataset(files_train)      # define data from randomly ordered files
ds = ds.shuffle(buffer_size=10000)             # select elements randomly from the buffer
ds = ds.map(_parser)  
ds = ds.map(map_func=iterator_to_data, num_parallel_calls=FLAGS.num_preprocessing_threads)
ds = ds.batch(BATCH_SIZE, drop_remainder=True) # group elements in batch (remove batch of less than BATCH_SIZE)
ds = ds.repeat()

you might also want to change a few lines in the iterator_to_data() fucntion as the input argument is not a iterator with the above changes. 您可能还想在iterator_to_data（）函数中更改几行，因为输入参数不是具有上述更改的迭代器。

2) You might also want to get the profiling information using something like tf.train.ProfilerHook . 2）您可能还想使用tf.train.ProfilerHook东西获取分析信息。 This can tell you if the bottleneck is with the cpu or gpu. 这可以告诉您瓶颈是否与cpu或gpu有关。 For example, if the bottleneck is with the CPU, you could see GPU ops waiting for memcpyHtoD op to complete. 例如，如果瓶颈在于CPU，您可以看到GPU操作等待memcpyHtoD操作完成。

如何为我的tensorflow模型提高此数据管道的性能

问题描述

2 个解决方案

解决方案1
9 已采纳 2018-12-02 23:03:44

解决方案2
6 2018-11-22 15:26:46

如何为我的tensorflow模型提高此数据管道的性能

问题描述

2 个解决方案

解决方案1 9 已采纳 2018-12-02 23:03:44

解决方案2 6 2018-11-22 15:26:46

解决方案1
9 已采纳 2018-12-02 23:03:44

解决方案2
6 2018-11-22 15:26:46