如何加速 tf.data.Dataset.from_generator()

Question

In tensorflow2.0, I want to train a skip-gram model with nce loss.在 tensorflow2.0 中，我想训练一个具有 nce 损失的 skip-gram 模型。 tf.data.Dataset.from_tensor_slices() is not suitable because the input file is really huge. tf.data.Dataset.from_tensor_slices() 不适合，因为输入文件真的很大。 So I write a dataset generator class like this:所以我写了一个这样的数据集生成器类：

class DataSet:
    """"""

    def __init__(self, args, vocab):
        self.args = args
        self.vocab = vocab

    def generator(self):
        """a generator function, it will return skip-gram sample or cbow sample"""
        with open(self.args.input) as f_input:
            for line in tqdm.tqdm(f_input.readlines()):
                tokens = line.strip().split()
                tokens_indices = self.vocab.indices(tokens)
                for index, target_word in enumerate(tokens_indices):
                    context_words = list()
                    begin = index - self.args.window_size if index - self.args.window_size > 0 else 0
                    end = index + 1 + self.args.window_size if index + self.args.window_size + 1 < len(tokens_indices) else len(
                        tokens_indices)
                    context_words.extend(tokens_indices[begin:index])
                    context_words.extend(tokens_indices[index + 1:end])
                    if self.args.cbow > 0:
                        yield context_words, target_word
                    else:
                        for i in range(len(context_words)):
                            yield target_word, context_words[i]

    def dataset(self):
        """Using tf.data.Dataset.from_generator() to return sample"""
        if self.args.cbow:
            dataset = tf.data.Dataset.from_generator(
                self.generator,
                (tf.int32, tf.int32),
                (tf.TensorShape([None]), tf.TensorShape([]))
            )
        else:
            dataset = tf.data.Dataset.from_generator(
                self.generator,
                (tf.int32, tf.int32),
                (tf.TensorShape([]), tf.TensorShape([]))
            )

        return dataset

Then I test my code with follow:然后我用以下方法测试我的代码：

dataset = DataSet(args, vocab).dataset()
iterator = dataset.make_one_shot_iterator()
for batch, (x,y) in enumerate(dataset.batch(128)):
    pass
print(batch, x.shape, y.shape)

But it cost a lot of time to iterate all lines(about 10 minutes / 15000 lines in MacBook pro 2012).但是迭代所有行需要花费大量时间（在 MacBook pro 2012 中大约 10 分钟/15000 行）。 Does there any methods can speed up the code?有没有什么方法可以加速代码？

Answer 1

If you are working with large datasets then TFRecord is suitable option.如果您正在处理大型数据集，那么 TFRecord 是合适的选择。 It uses a binary file format for storage of your data and can have a significant impact on the performance of your import pipeline and as a consequence on the training time of your model.它使用二进制文件格式来存储您的数据，并且会对导入管道的性能产生重大影响，从而对模型的训练时间产生重大影响。 Binary data takes up less space on disk, takes less time to copy and can be read much more efficiently from disk.二进制数据占用更少的磁盘空间，复制所需的时间更少，并且可以更有效地从磁盘读取。 This is especially true if your data is stored on spinning disks, due to the much lower read/write performance in comparison with SSDs.如果您的数据存储在旋转磁盘上，则尤其如此，因为与 SSD 相比读/写性能要低得多。

However, pure performance isn't the only advantage of the TFRecord file format.然而，纯粹的性能并不是 TFRecord 文件格式的唯一优势。 It is optimized for use with Tensorflow in multiple ways.它以多种方式针对 Tensorflow 进行了优化。 To start with, it makes it easy to combine multiple datasets and integrates seamlessly with the data import and preprocessing functionality provided by the library.首先，它可以轻松组合多个数据集，并与库提供的数据导入和预处理功能无缝集成。 Especially for datasets that are too large to be stored fully in memory this is an advantage as only the data that is required at the time (eg a batch) is loaded from disk and then processed.特别是对于太大而无法完全存储在内存中的数据集，这是一个优势，因为只有当时需要的数据（例如批处理）从磁盘加载然后进行处理。 Another major advantage of TFRecords is that it is possible to store sequence data — for instance, a time series or word encodings — in a way that allows for very efficient and (from a coding perspective) convenient import of this type of data. TFRecords 的另一个主要优点是可以存储序列数据——例如，时间序列或单词编码——以允许非常有效和（从编码角度）方便地导入此类数据的方式。

Would recommend to go through this official link for glimpse on TFRecord .建议通过此官方链接查看TFRecord 。 Also you can go through this link on how to build TFRecord pipeline .您也可以通过此链接了解如何构建 TFRecord 管道。

Here is a simple example of writing the serialized record using TFRecordWriter and then loading it in TFRecordDatset下面是一个简单的例子，使用TFRecordWriter写入序列化记录，然后将其加载到TFRecordDatset

%tensorflow_version 2.x
import tensorflow as tf
print(tf.__version__)

def write_date_tfrecord():  
    #writes 10 dummy values to replicate the issue
    Output = [20191221 + x for x in range(0,10)]
    print("Writing Output - ", Output)

    example = tf.train.Example(
            features = tf.train.Features(
                feature = {                    
                    'Output':tf.train.Feature(float_list=tf.train.FloatList(value=Output))                    
                     }
                ))


    writer = tf.io.TFRecordWriter("Output.tf_record")
    writer.write(example.SerializeToString())

def parse_function(serialized_example):
        features = {
            'Output': tf.io.FixedLenSequenceFeature([], tf.float32,allow_missing=True) 
             }
        features = tf.io.parse_single_example(serialized=serialized_example, features=features)  
        Output = features['Output']
        return Output

def dataset_generator():
    trRecordDataset = tf.data.TFRecordDataset("Output.tf_record")
    trRecordDataset = trRecordDataset.map(parse_function, num_parallel_calls = tf.data.experimental.AUTOTUNE)
    return trRecordDataset

if __name__ == '__main__':
    write_date_tfrecord()
    generator = dataset_generator()
    for Output in generator:
        print(Output)

Output -输出 -

2.2.0
Writing Output -  [20191221, 20191222, 20191223, 20191224, 20191225, 20191226, 20191227, 20191228, 20191229, 20191230]
tf.Tensor(
[20191220. 20191222. 20191224. 20191224. 20191224. 20191226. 20191228.
 20191228. 20191228. 20191230.], shape=(10,), dtype=float32)

Hope this answers your question.希望这能回答你的问题。 Happy Learning.快乐学习。

如何加速 tf.data.Dataset.from_generator()

问题描述

1 个解决方案

解决方案1
2 2020-06-02 10:51:27

如何加速 tf.data.Dataset.from_generator()

问题描述

1 个解决方案

解决方案1 2 2020-06-02 10:51:27

解决方案1
2 2020-06-02 10:51:27