[英]Tensorflow data pipeline: Slow with caching to disk - how to improve evaluation performance?
I've built a data pipeline. 我已经建立了数据管道。 Pseudo code is as follows:
伪代码如下:
1 to 5 actions are required to generate the pretrained model outputs. 生成预训练模型输出需要1到5个动作。 I cache them as they do not change throughout epochs (pretrained model weights are not updated).
我将它们缓存,因为它们在整个时期都不会改变(预训练的模型权重不会更新)。 5 to 8 actions prepare the dataset for training.
5到8个动作准备训练数据集。
Different batch sizes have to be used, as the pretrained model inputs are of a much bigger dimensionality than the outputs. 必须使用不同的批次大小,因为预训练的模型输入的维数比输出的维数大得多。
The first epoch is slow, as it has to evaluate the pretrained model on every input item to generate templates and save them to the disk. 第一个时期很慢,因为它必须评估每个输入项上的预训练模型以生成模板并将其保存到磁盘。 Later epochs are faster, yet they're still quite slow - I suspect the bottleneck is reading the disk cache.
以后的时期更快, 但仍然很慢-我怀疑瓶颈正在读取磁盘缓存。
What could be improved in this data pipeline to reduce the issue? 在此数据管道中可以改善什么以减少问题? Thank you!
谢谢!
prefetch(1)
means that there will be only one element prefetched, I think you may want to have it as big as the batch size or larger. prefetch(1)
表示将仅预提取一个元素,我想您可能希望它与批处理大小一样大或更大。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.