TensorFlow - 交错多个独立预处理的 TFRecord 文件

Question

I have multiple TFRecord files from the Waymo Dataset, each containing consecutive points that are not consecutive across files.我有来自 Waymo 数据集的多个TFRecord文件，每个文件都包含在文件中不连续的连续点。 I'm building an input pipeline that preprocesses data for time series prediction via the window() API but I need to avoid the window to span accross multiple files.我正在构建一个输入管道，通过window() API 为时间序列预测预处理数据，但我需要避免 window 跨越多个文件。

To do so, I believe I should preprocess each file indipentently and interleave the final datasets.为此，我相信我应该单独预处理每个文件并交错最终数据集。 Here's my attempt:这是我的尝试：

import tensorflow as tf
from waymo_open_dataset import dataset_pb2 as open_dataset #for parsing Waymo frames

filenames = [os.path.join(DATASET_DIR, f) for f in os.listdir(DATASET_DIR)]
dataset = tf.data.TFRecordDataset(filenames, compression_type='')

def interleave_fn(filename):
    ds = filename.map(lambda x: tf.py_function(_parse_data, [x], [tf.float32]*N_FEATURES,), 
                          num_parallel_calls=tf.data.experimental.AUTOTUNE) 
    ds = ds.map(_concatenate_tensors).map(_set_x_shape)
    ds = build_x_dataset(ds)
    return ds

def _parse_data(data):
    # Parse feature from Waymo dataset  
    frame = open_dataset.Frame()
    frame.ParseFromString(bytearray(data.numpy()))   
    av_v_x = frame.images[0].velocity.v_x 
    av_v_y = frame.images[0].velocity.v_y 
    return av_v_x, av_v_y

def _concatenate_tensors(*x):
    #Concatenate tensor tuple in a single tensor
    return tf.stack((x))

def _set_x_shape(x):
    #Set X dataset shape. If not UNDEFINED RANK ValueError
    x.set_shape((N_FEATURES,))
    return x
    
def build_x_dataset(ds_x, window = WINDOW):
    # Extract sequences for time series prediction training
    # Selects a sliding window of WINDOW samples, shifting by 1 sample at a time
    ds_x = ds_x.window(size = window, shift = 1, drop_remainder = True)
    
    # Each element of `ds_x` is a nested dataset containing WINDOWconsecutive examples 
    ds_x = ds_x.map(lambda d: tf.data.experimental.get_single_element(d.batch(window))) 
    return ds_x

dataset = dataset.interleave(interleave_fn, num_parallel_calls=tf.data.experimental.AUTOTUNE)

This returns这返回

AttributeError: in user code:

    /tmp/xpython_26752/494049692.py:118 interleave_fn  *
        ds = filename.map(lambda x: tf.py_function(_parse_data, [x], [tf.float32]*N_FEATURES,),

    AttributeError: 'Tensor' object has no attribute 'map'

which makes sense because print(filename) in interleave_fn gives这是有道理的，因为interleave_fn中的print(filename)给出了

Tensor("args_0:0", shape=(), dtype=string)

I thought the interleave_fn would be applied to each TFRecordDataset , so filename would be a dataset itself instead of a tensor.我认为interleave_fn将应用于每个TFRecordDataset ，因此filename名将是数据集本身而不是张量。 What's wrong here?这里有什么问题？ Thank you!谢谢！

Answer 1

Solved it by looping over all TFRecord files and appending the corresponding datasets to a dataset list.通过遍历所有 TFRecord 文件并将相应的数据集附加到数据集列表来解决它。 Then, following this tip to interleave all the preprocessed datasets.然后，按照此提示交错所有预处理的数据集。

TensorFlow - 交错多个独立预处理的 TFRecord 文件

问题描述

1 个解决方案

解决方案1
0 已采纳 2020-12-14 13:14:16

TensorFlow - 交错多个独立预处理的 TFRecord 文件

问题描述

1 个解决方案

解决方案1 0 已采纳 2020-12-14 13:14:16

解决方案1
0 已采纳 2020-12-14 13:14:16