分别使用 timeseries_dataset_from_array 和 TimeseriesGenerator 对齐 tensorflow/keras 的批量滑动帧时间序列数据

Question

I have multiple input features and a singular target feature that correspond 1:1 to each other's index;我有多个输入特征和一个单一的目标特征，它们与彼此的索引 1:1 对应； meaning there should be no forward-looking or backward-looking when it comes to comparing inputs to targets: input[t] <=> target[t] .这意味着在将输入与目标进行比较时不应该有前瞻性或向后看： input[t] <=> target[t] 。 Essentially, I have already time-shifted my targets backwards to their corresponding input indexes for training purposes.本质上，为了训练目的，我已经将我的目标向后移动到相应的输入索引。

Under normal operating procedures, I would use N periods worth of past data in order to predict 1 future value, N periods ahead.在正常操作程序下，我会使用N个周期的过去数据来预测 1 个未来值， N个周期。 As the frame shifts forward in time, each respective slot is filled with the [t+N] forecast, recorded at [t] .随着帧在时间上向前移动，每个相应的时隙都充满了[t+N]预测，记录在[t] 。

Now, based on whatever environment I'm developing in, I will need to use either timeseries_dataset_from_array or TimeseriesGenerator to batch my data (based on system support).现在，根据我正在开发的任何环境，我将需要使用timeseries_dataset_from_array或TimeseriesGenerator来批处理我的数据（基于系统支持）。 I need to know if the implementation I made produces batches that will do what I expect when running model.fit() in keras.我需要知道在 keras 中运行model.fit()时，我所做的实现是否会产生符合我期望的批次。 I'm unsure of whether or not keras is internally shifting data during fitting that I'm unaware of that might lead to poor results.我不确定 keras 在拟合期间是否在内部转移数据，我不知道这可能会导致结果不佳。

I'm using an LSTM potentially with the stateful argument so I need to ensure my batches are a perfect fit, and I also wanted to ensure the batch sizes are a factor of 2 (according to some posts regarding processor efficiency).我正在使用带有stateful参数的 LSTM，因此我需要确保我的批次完美匹配，并且我还想确保批次大小是 2 倍（根据一些关于处理器效率的帖子）。 I've tried implementing my own function for making this happen given a few additional assumptions regarding validation/test sizes.我已经尝试实现我自己的 function 来实现这一点，因为对验证/测试大小有一些额外的假设。 On the surface it appears that everything looks good, but since I'm unsure of keras' internals I don't know if I've made a blunder.从表面上看，一切看起来都不错，但由于我不确定 keras 的内部结构，我不知道我是否犯了错误。

My question is whether or not I've properly aligned/batched the inputs and targets using timeseries_dataset_from_array / TimeseriesGenerator such that running model.fit() will train using losses/metrics that compare the target at time [t] with the predicted value at time [t] using inputs at time [t] .我的问题是我是否已经使用timeseries_dataset_from_array / TimeseriesGenerator正确对齐/批处理输入和目标，以便运行model.fit()将使用将时间[t]的目标与时间的预测值进行比较的损失/指标进行训练[t]在时间[t] ] 使用输入。

import pandas as pd
import numpy as np

use_ts_data = True
try:
    # Comment this line out if you want to test timeseries_dataset_from_array
    raise ImportError("No TDFA for you")
    from tensorflow.keras.preprocessing import timeseries_dataset_from_array as ts_data
except (ModuleNotFoundError, ImportError):
    from tensorflow.keras.preprocessing.sequence import TimeseriesGenerator as ts_gen

    use_ts_data = False

def gp2(size):
    return np.power(2, int(np.log2((size))))

def train_validate_test_split(
    features, targets, train_size_ratio=0.5, max_batch_size=None, memory=1,
):
    def batch_size_with_buffer(buffer, available, desired, max_batch_size):
        batch_size = gp2(min(desired, max_batch_size or np.inf))
        if available < batch_size * 3 + buffer:
            # If we don't have enough records to support this batch_size, use 1 power lower
            batch_size = np.power(2, np.log(batch_size) / np.log(2) - 1)
        return int(batch_size)

    memory = max(1, memory)
    surplus = memory - 1
    test_size_ratio = 1 - train_size_ratio
    total_size = features.shape[0]
    smallest_size = int(total_size * test_size_ratio / 2)

    # Error on insufficient data
    def insufficient_data():
        raise RuntimeError(
            f"Insufficient data on which to split train/validation/test when ratio={train_size_ratio}%, nobs={total_size} and memory={memory}"
        )

    if total_size < memory + 3:
        insufficient_data()

    # Find greatest batch size that is a power of 2, that fits the smallest dataset size, and is no greater than max_batch_size
    batch_size = batch_size_with_buffer(
        surplus, total_size, smallest_size, max_batch_size
    )
    test_size = smallest_size - smallest_size % batch_size

    # Create/align the datasets
    if use_ts_data:
        index_offset = None
        
        start = -test_size
        X_test = features.iloc[start - surplus:]
        y_test = targets.iloc[start:]

        end = start
        start = end - test_size
        X_validation = features.iloc[start - surplus:end]
        y_validation = targets.iloc[start:end]

        end = start
        start = (total_size + end - surplus) % batch_size
        X_train = features.iloc[start:end]
        y_train = targets.iloc[start + surplus:end]
    else:
        index_offset = memory
        _features = features.shift(-1)
        
        start = -test_size - memory
        X_test = _features.iloc[start:]
        y_test = targets.iloc[start:]

        end = start + memory
        start = end - test_size - memory
        X_validation = _features.iloc[start:end]
        y_validation = targets.iloc[start:end]

        end = start + memory
        start = (total_size + end - memory) % batch_size
        X_train = _features.iloc[start:end]
        y_train = targets.iloc[start:end]

    # Record indexes
    test_index = y_test.index[index_offset:]
    validation_index = y_validation.index[index_offset:]
    train_index = y_train.index[index_offset:]
    
    if memory > X_train.shape[0] or memory > X_validation.shape[0]:
        insufficient_data()

    format_data = ts_data if use_ts_data else ts_gen
    train = format_data(X_train.values, y_train.values, memory, batch_size=batch_size)
    validation = format_data(
        X_validation.values, y_validation.values, memory, batch_size=batch_size
    )
    test = format_data(X_test.values, y_test.values, memory, batch_size=batch_size)

    # Print out the batched data for inspection
    def results(dataset, index):
        print("\n-------------------\n")
        print(f"Index:\n\n", index, "\n\n")
        last_i = len(dataset) - 1
        for i, batch in enumerate(dataset):
            inputs, targets = batch
            if i == 0:
                print(
                    f"First:\n\nInputs:\n",
                    inputs[0][-1],
                    "...",
                    inputs[-1][-1],
                    f"\n\nTargets:\n",
                    targets[0],
                    "...",
                    targets[-1],
                )
                print(inputs.shape, targets.shape, "\n\n")
            if i == last_i:
                print(
                    f"Last:\n\nInputs:\n",
                    inputs[0][-1],
                    "...",
                    inputs[-1][-1],
                    f"\n\nTargets:\n",
                    targets[0],
                    "...",
                    targets[-1],
                )
                print(inputs.shape, targets.shape, "\n\n")
        print("\n-------------------\n")

    results(train, train_index)
    results(validation, validation_index)
    results(test, test_index)

    return (
        batch_size,
        train,
        validation,
        test,
        train_index,
        validation_index,
        test_index,
    )

# inputs and targets are expected to be aligned (i.e., loss functions should subtract the predicted target@t from the actual target@t)
x = np.arange(101)
df = pd.DataFrame(index=x)
df['inputs'] = x
df['targets'] = x

batch_size, train, validation, test, train_index, validation_index, test_index = train_validate_test_split(df['inputs'], df['targets'], train_size_ratio=0.5, max_batch_size=2, memory=8)

Answer 1

All loss/metric functions rely on y_pred and y_true assume matching indices.所有损失/度量函数都依赖于y_pred和y_true假设匹配索引。 There's nothing special that Keras does in the background. Keras 在后台没有什么特别之处。

分别使用 timeseries_dataset_from_array 和 TimeseriesGenerator 对齐 tensorflow/keras 的批量滑动帧时间序列数据

问题描述

1 个解决方案

解决方案1
0 已采纳 2022-07-06 02:57:21

分别使用 timeseries_dataset_from_array 和 TimeseriesGenerator 对齐 tensorflow/keras 的批量滑动帧时间序列数据

问题描述

1 个解决方案

解决方案1 0 已采纳 2022-07-06 02:57:21

解决方案1
0 已采纳 2022-07-06 02:57:21