如何在TensorFlow中使用parallel_interleave

Question

I am reading the code in TensorFlow benchmarks repo . 我正在阅读TensorFlow 基准测试报告中的代码。 The following piece of code is the part that creates TensorFlow dataset from TFRecord files: 以下代码是从TFRecord文件创建TensorFlow数据集的部分：

ds = tf.data.TFRecordDataset.list_files(tfrecord_file_names)
ds = ds.apply(interleave_ops.parallel_interleave(tf.data.TFRecordDataset, cycle_length=10))

I am trying to change this code to create dataset directly from JPEG image files: 我试图更改此代码以直接从JPEG图像文件创建数据集：

ds = tf.data.Dataset.from_tensor_slices(jpeg_file_names)
ds = ds.apply(interleave_ops.parallel_interleave(?, cycle_length=10))

I don't know what to write in the ? 我不知道写什么？ place. 地点。 The map_func in parallel_interleave() is __init__() of tf.data.TFRecordDataset class for TFRecord files, but I don't know what to write for JPEG files. parallel_interleave（）中的map_func是TF_cord文件的tf.data.TFRecordDataset类的__init __（），但我不知道要为JPEG文件写什么。

We don't need to do any transformations here. 我们不需要在这里进行任何转换。 Because we will zip two datasets and then do the transformations later. 因为我们将压缩两个数据集，然后再进行转换。 The code is as follows: 代码如下：

counter = tf.data.Dataset.range(batch_size)
ds = tf.data.Dataset.zip((ds, counter))
ds = ds.apply( \
     batching.map_and_batch( \
     map_func=preprocess_fn, \
     batch_size=batch_size, \
     num_parallel_batches=num_splits))

Because we don't need transformation in ? 因为我们不需要改造吗？ place, I tried to use an empty map_func, but there is error "map_func must return a Dataset` object". 地方，我试图使用一个空的map_func，但有错误“map_func must return a `对象”。 I also tried to use tf.data.Dataset, but the output says Dataset is an abstract class that is not allowed to put there. 我也尝试使用tf.data.Dataset，但是输出说Dataset是一个不允许放在那里的抽象类。

Anyone can help this? 任何人都可以帮忙吗？ Thanks very much. 非常感谢。

Answer 1

parallel_interleave is useful when you have a transformation that transforms each element of a source dataset into multiple elements into the destination dataset. 当您具有将源数据集的每个元素转换为多个元素到目标数据集的转换时， parallel_interleave非常有用。 I'm not sure why they use it in the benchmarks repo like that, when they could have just used a map with parallel calls. 我不确定为什么他们会在基准测试报告中使用它，当他们可以使用并行调用的map时。

Here's how I suggest using parallel_interleave for reading images from several directories, each containing one class: 以下是我建议使用parallel_interleave从多个目录中读取图像的方法，每个目录包含一个类：

classes = sorted(glob(directory + '/*/')) # final slash selects directories only
num_classes = len(classes)

labels = np.arange(num_classes, dtype=np.int32)

dirs = DS.from_tensor_slices((classes, labels))               # 1
files = dirs.apply(tf.contrib.data.parallel_interleave(
    get_files, cycle_length=num_classes, block_length=4,      # 2
    sloppy=False)) # False is important ! Otherwise it mixes labels
files = files.cache()
imgs = files.map(read_decode, num_parallel_calls=20)\.        # 3
            .apply(tf.contrib.data.shuffle_and_repeat(100))\
            .batch(batch_size)\
            .prefetch(5)

There are three steps. 有三个步骤。 First, we get the list of directories and their labels ( #1 ). 首先，我们获取目录及其标签列表（ #1 ）。

Then, we map these to a dataset of files. 然后，我们将这些映射到文件的数据集。 But if we do a simple .flatmap() , we will end up with all the files of label 0 , followed by all the files of label 1 , then 2 etc ... Then we'd need really large shuffle buffers to get a meaningful shuffle. 但是如果我们做一个简单的.flatmap() ，我们最终会得到标签0的所有文件，然后是标签1的所有文件，然后是2等...然后我们需要非常大的shuffle缓冲区才能得到一个有意义的洗牌。

So, instead, we apply parallel_interleave ( #2 ). 因此，我们应用parallel_interleave （ #2 ）。 Here is the get_files() : 这是get_files() ：

def get_files(dir_path, label):
    globbed = tf.string_join([dir_path, '*.jpg'])
    files = tf.matching_files(globbed)

    num_files = tf.shape(files)[0] # in the directory
    labels = tf.tile([label], [num_files, ]) # expand label to all files
    return DS.from_tensor_slices((files, labels))

Using parallel_interleave ensures the list_files of each directory is run in parallel, so by the time the first block_length files are listed from the first directory, the first block_length files from the 2nd directory will also be available (also from 3rd, 4th etc). 使用parallel_interleave确保list_files每个目录的并行运行，所以由第一时间block_length文件从第一目录中列出，第一block_length从第二目录中的文件也将是可用的（也由第三，第四等）。 Moreover, the resulting dataset will contain interleaved blocks of each label, eg 1 1 1 1 2 2 2 2 3 3 3 3 3 1 1 1 1 ... (for 3 classes and block_length=4 ) 此外，结果数据集将包含每个标签的交错块，例如1 1 1 1 2 2 2 2 3 3 3 3 3 1 1 1 1 ... （3类和block_length=4 ）

Finally, we read the images from the list of files ( #3 ). 最后，我们从文件列表中读取图像（ #3 ）。 Here is read_and_decode() : 这是read_and_decode() ：

def read_decode(path, label):
    img = tf.image.decode_image(tf.read_file(path), channels=3)
    img = tf.image.resize_bilinear(tf.expand_dims(img, axis=0), target_size)
    img = tf.squeeze(img, 0)
    img = preprocess_fct(img) # should work with Tensors !

    label = tf.one_hot(label, num_classes)
    img = tf.Print(img, [path, label], 'Read_decode')
    return (img, label)

This function takes an image path and its label and returns a tensor for each: image tensor for the path, and one_hot encoding for the label. 此函数采用图像路径及其标签，并为每个：路径的图像张量和标签的one_hot编码返回张量。 This is also the place where you can do all the transformations on the image. 这也是您可以对图像进行所有转换的地方。 Here, I do resizing and basic pre-processing. 在这里，我做了调整大小和基本的预处理。

如何在TensorFlow中使用parallel_interleave

问题描述

1 个解决方案

解决方案1
8 已采纳 2018-06-05 08:59:02

如何在TensorFlow中使用parallel_interleave

问题描述

1 个解决方案

解决方案1 8 已采纳 2018-06-05 08:59:02

解决方案1
8 已采纳 2018-06-05 08:59:02