简体繁体 English

Tensorflow数据管道：缓存到磁盘的速度很慢-如何提高评估性能？

[英]Tensorflow data pipeline: Slow with caching to disk - how to improve evaluation performance?

原文 2019-02-28 21:02:20 3 1 python/ tensorflow/ machine-learning/ tensorflow-datasets

I've built a data pipeline. 我已经建立了数据管道。 Pseudo code is as follows: 伪代码如下：

dataset -> 数据集->
dataset = augment(dataset) 数据集=扩充（数据集）
dataset = dataset.batch(35).prefetch(1) 数据集= dataset.batch（35）.prefetch（1）
dataset = set_from_generator(to_feed_dict(dataset)) # expensive op 数据集= set_from_generator（to_feed_dict（数据集））＃昂贵的操作
dataset = Cache('/tmp', dataset) 数据集=缓存（'/ tmp'，数据集）
dataset = dataset.unbatch() 数据集= dataset.unbatch（）
dataset = dataset.shuffle(64).batch(256).prefetch(1) 数据集= dataset.shuffle（64）.batch（256）.prefetch（1）
to_feed_dict(dataset) to_feed_dict（集）

1 to 5 actions are required to generate the pretrained model outputs. 生成预训练模型输出需要1到5个动作。 I cache them as they do not change throughout epochs (pretrained model weights are not updated). 我将它们缓存，因为它们在整个时期都不会改变（预训练的模型权重不会更新）。 5 to 8 actions prepare the dataset for training. 5到8个动作准备训练数据集。

Different batch sizes have to be used, as the pretrained model inputs are of a much bigger dimensionality than the outputs. 必须使用不同的批次大小，因为预训练的模型输入的维数比输出的维数大得多。

The first epoch is slow, as it has to evaluate the pretrained model on every input item to generate templates and save them to the disk. 第一个时期很慢，因为它必须评估每个输入项上的预训练模型以生成模板并将其保存到磁盘。 Later epochs are faster, yet they're still quite slow - I suspect the bottleneck is reading the disk cache. 以后的时期更快， 但仍然很慢-我怀疑瓶颈正在读取磁盘缓存。

What could be improved in this data pipeline to reduce the issue? 在此数据管道中可以改善什么以减少问题？ Thank you! 谢谢！

1 个解决方案

prefetch(1) means that there will be only one element prefetched, I think you may want to have it as big as the batch size or larger. prefetch(1)表示将仅预提取一个元素，我想您可能希望它与批处理大小一样大或更大。
After first cache you may try to put it second time but without providing a path, so it would cache some in the memory. 第一次缓存后，您可以尝试第二次放置它，但不提供路径，因此它将在内存中缓存一些内容。
Maybe your HDD is just slow? 也许您的硬盘速度很慢？ ;) ;）
Another idea is you could just manually write to compressed TFRecord after steps 1-4 and then read it with another dataset. 另一个想法是，您可以只在步骤1-4之后手动写入压缩的TFRecord，然后与另一个数据集一起读取它。 Compressed file has lower I/O but causes higher CPU usage. 压缩文件的I / O较低，但会导致CPU使用率较高。