简体   繁体   English

使用 Tensorflow Interleave 提高性能

[英]Using Tensorflow Interleave to Improve Performance

I have an input pipe that is performing poorly with low CPU, GPU, and disk utilization.我有一个输入管道,它在 CPU、GPU 和磁盘利用率低的情况下表现不佳。 I've been reading the tensorflow "Better performance with tf.data API" doc and the Dataset docs, but I don't understand what's going on well enough to apply it to my situation.我一直在阅读 tensorflow “Better performance with tf.data API” doc 和 Dataset docs,但我不明白发生了什么足以将其应用于我的情况。 Here's my current setup:这是我目前的设置:

img_files = sorted(tf.io.gfile.glob(...))
imgd = tf.data.FixedLengthRecordDataset(img_files, inrez*inrez)
#POINT1A
imgd = imgd.map(lambda s: tf.reshape(tf.io.decode_raw(s, tf.int8), (inrez,inrez,1)))
imgd = imgd.map(lambda x: tf.cast(x, dtype=tf.float32))

out_files = sorted(tf.io.gfile.glob(...))
outd = tf.data.FixedLengthRecordDataset(out_files, 4, compression_type="GZIP")
#POINT1B
outd = outd.map(lambda s: tf.io.decode_raw(s, tf.float32))

xsrc = tf.data.Dataset.zip((imgd, outd)).batch(batchsize)
xsrc = xsrc.repeat()        # indefinitely
#POINT2
xsrc = xsrc.prefetch(buffer_size=tf.data.experimental.AUTOTUNE)

Should I interleave the whole pipe right at the end (POINT2), before the prefetch?在预取之前,我应该在末尾(POINT2)交错整个管道吗? Or interleave imgd and outd separately, after each FixedLengthRecordDataset (POINT1A, POINT1B), and parallelize the maps?或者在每个 FixedLengthRecordDataset (POINT1A, POINT1B) 之后分别交错 imgd 和 outd,并并行化地图? (need to keep the imgd and outd synced up!) What's up with Dataset.range(rvalue)---seems it's necessary but not obvious what rvalue to use? (需要保持 imgd 和 outd 同步!)Dataset.range(rvalue) 怎么了---似乎有必要但不明显使用什么右值? Is there a better overall plan?有没有更好的整体方案?

Note that the datasets are very large and do not fit in RAM.请注意,数据集非常大,不适合 RAM。

Interleave lets you process each file in a separate logical thread (in parallel), then combine the data from the files into a single dataset. Interleave 允许您在单独的逻辑线程(并行)中处理每个文件,然后将文件中的数据合并到单个数据集中。 Since your data comes from two corresponding files, you need to be careful to preserve the order.由于您的数据来自两个对应的文件,因此您需要小心保留顺序。

Here is an example of how you could put the interleave near the end of the dataset:以下是如何将交错放置在数据集末尾附近的示例:

img_files = ...
out_files = ...
files = tf.data.Dataset.zip(img_files, out_files)

def parse_img_file(img_file):
  imgd = tf.data.FixedLengthRecordDataset(img_files, inrez*inrez)
  ...

def parse_out_file(out_file):
  ...

def parse_files_fn(img_file, out_file):
  img_file_dataset = parse_img_file(img_file)
  out_file_dataset = parse_out_file(out_file)
  return tf.data.Dataset.zip(img_file_dataset, out_file_dataset)

dataset = files.interleave(parse_files_fn, num_parallel_calls=tf.data.experimental.AUTOTUNE)
dataset = dataset.repeat()

Each thread of the interleave will produce elements from a different pair of (img, out) files, and the elements produced from each pair of files will be interleaved together.交错的每个线程将从不同的 (img, out) 文件对生成元素,并且从每对文件生成的元素将被交错在一起。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 将 python/numpy 索引转移到 Tensorflow 并提高性能 - Transfer python/numpy indexing to Tensorflow and improve performance Tensorflow数据管道:缓存到磁盘的速度很慢-如何提高评估性能? - Tensorflow data pipeline: Slow with caching to disk - how to improve evaluation performance? 如何为我的tensorflow模型提高此数据管道的性能 - How to improve the performance of this data pipeline for my tensorflow model tensorflow:如何交错两个张量的列(例如使用 tf.scatter_nd)? - tensorflow: how to interleave columns of two tensors (e.g. using tf.scatter_nd)? TensorFlow - 交错多个独立预处理的 TFRecord 文件 - TensorFlow - Interleave multiple indipently preprocessed TFRecord files 如何在TensorFlow中使用parallel_interleave - How to use parallel_interleave in TensorFlow 使用 numba jit 提高 python 脚本的性能 - Improve performance of python script using numba jit 如何使用 SVM 提高不平衡数据集的性能 - How to improve performance for imbalanced dataset using SVM 如何使用Tensorflow对象检测API提高对象检测的精度? - How to improve precision of object detection using tensorflow object detection API? 如何使用numpy提高python代码的性能 - How can I improve python code performance using numpy
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM