[英]Aligning batched sliding frame timeseries data for tensorflow/keras using timeseries_dataset_from_array and TimeseriesGenerator respectively
I have multiple input features and a singular target feature that correspond 1:1 to each other's index;我有多个输入特征和一个单一的目标特征,它们与彼此的索引 1:1 对应; meaning there should be no forward-looking or backward-looking when it comes to comparing inputs to targets:
input[t] <=> target[t]
.这意味着在将输入与目标进行比较时不应该有前瞻性或向后看:
input[t] <=> target[t]
。 Essentially, I have already time-shifted my targets backwards to their corresponding input indexes for training purposes.本质上,为了训练目的,我已经将我的目标向后移动到相应的输入索引。
Under normal operating procedures, I would use N
periods worth of past data in order to predict 1 future value, N
periods ahead.在正常操作程序下,我会使用
N
个周期的过去数据来预测 1 个未来值, N
个周期。 As the frame shifts forward in time, each respective slot is filled with the [t+N]
forecast, recorded at [t]
.随着帧在时间上向前移动,每个相应的时隙都充满了
[t+N]
预测,记录在[t]
。
Now, based on whatever environment I'm developing in, I will need to use either timeseries_dataset_from_array
or TimeseriesGenerator
to batch my data (based on system support).现在,根据我正在开发的任何环境,我将需要使用
timeseries_dataset_from_array
或TimeseriesGenerator
来批处理我的数据(基于系统支持)。 I need to know if the implementation I made produces batches that will do what I expect when running model.fit()
in keras.我需要知道在 keras 中运行
model.fit()
时,我所做的实现是否会产生符合我期望的批次。 I'm unsure of whether or not keras is internally shifting data during fitting that I'm unaware of that might lead to poor results.我不确定 keras 在拟合期间是否在内部转移数据,我不知道这可能会导致结果不佳。
I'm using an LSTM potentially with the stateful
argument so I need to ensure my batches are a perfect fit, and I also wanted to ensure the batch sizes are a factor of 2 (according to some posts regarding processor efficiency).我正在使用带有
stateful
参数的 LSTM,因此我需要确保我的批次完美匹配,并且我还想确保批次大小是 2 倍(根据一些关于处理器效率的帖子)。 I've tried implementing my own function for making this happen given a few additional assumptions regarding validation/test sizes.我已经尝试实现我自己的 function 来实现这一点,因为对验证/测试大小有一些额外的假设。 On the surface it appears that everything looks good, but since I'm unsure of keras' internals I don't know if I've made a blunder.
从表面上看,一切看起来都不错,但由于我不确定 keras 的内部结构,我不知道我是否犯了错误。
My question is whether or not I've properly aligned/batched the inputs and targets using timeseries_dataset_from_array
/ TimeseriesGenerator
such that running model.fit()
will train using losses/metrics that compare the target at time [t]
with the predicted value at time [t]
using inputs at time [t]
.我的问题是我是否已经使用
timeseries_dataset_from_array
/ TimeseriesGenerator
正确对齐/批处理输入和目标,以便运行model.fit()
将使用将时间[t]
的目标与时间的预测值进行比较的损失/指标进行训练[t]
在时间[t]
] 使用输入。
import pandas as pd
import numpy as np
use_ts_data = True
try:
# Comment this line out if you want to test timeseries_dataset_from_array
raise ImportError("No TDFA for you")
from tensorflow.keras.preprocessing import timeseries_dataset_from_array as ts_data
except (ModuleNotFoundError, ImportError):
from tensorflow.keras.preprocessing.sequence import TimeseriesGenerator as ts_gen
use_ts_data = False
def gp2(size):
return np.power(2, int(np.log2((size))))
def train_validate_test_split(
features, targets, train_size_ratio=0.5, max_batch_size=None, memory=1,
):
def batch_size_with_buffer(buffer, available, desired, max_batch_size):
batch_size = gp2(min(desired, max_batch_size or np.inf))
if available < batch_size * 3 + buffer:
# If we don't have enough records to support this batch_size, use 1 power lower
batch_size = np.power(2, np.log(batch_size) / np.log(2) - 1)
return int(batch_size)
memory = max(1, memory)
surplus = memory - 1
test_size_ratio = 1 - train_size_ratio
total_size = features.shape[0]
smallest_size = int(total_size * test_size_ratio / 2)
# Error on insufficient data
def insufficient_data():
raise RuntimeError(
f"Insufficient data on which to split train/validation/test when ratio={train_size_ratio}%, nobs={total_size} and memory={memory}"
)
if total_size < memory + 3:
insufficient_data()
# Find greatest batch size that is a power of 2, that fits the smallest dataset size, and is no greater than max_batch_size
batch_size = batch_size_with_buffer(
surplus, total_size, smallest_size, max_batch_size
)
test_size = smallest_size - smallest_size % batch_size
# Create/align the datasets
if use_ts_data:
index_offset = None
start = -test_size
X_test = features.iloc[start - surplus:]
y_test = targets.iloc[start:]
end = start
start = end - test_size
X_validation = features.iloc[start - surplus:end]
y_validation = targets.iloc[start:end]
end = start
start = (total_size + end - surplus) % batch_size
X_train = features.iloc[start:end]
y_train = targets.iloc[start + surplus:end]
else:
index_offset = memory
_features = features.shift(-1)
start = -test_size - memory
X_test = _features.iloc[start:]
y_test = targets.iloc[start:]
end = start + memory
start = end - test_size - memory
X_validation = _features.iloc[start:end]
y_validation = targets.iloc[start:end]
end = start + memory
start = (total_size + end - memory) % batch_size
X_train = _features.iloc[start:end]
y_train = targets.iloc[start:end]
# Record indexes
test_index = y_test.index[index_offset:]
validation_index = y_validation.index[index_offset:]
train_index = y_train.index[index_offset:]
if memory > X_train.shape[0] or memory > X_validation.shape[0]:
insufficient_data()
format_data = ts_data if use_ts_data else ts_gen
train = format_data(X_train.values, y_train.values, memory, batch_size=batch_size)
validation = format_data(
X_validation.values, y_validation.values, memory, batch_size=batch_size
)
test = format_data(X_test.values, y_test.values, memory, batch_size=batch_size)
# Print out the batched data for inspection
def results(dataset, index):
print("\n-------------------\n")
print(f"Index:\n\n", index, "\n\n")
last_i = len(dataset) - 1
for i, batch in enumerate(dataset):
inputs, targets = batch
if i == 0:
print(
f"First:\n\nInputs:\n",
inputs[0][-1],
"...",
inputs[-1][-1],
f"\n\nTargets:\n",
targets[0],
"...",
targets[-1],
)
print(inputs.shape, targets.shape, "\n\n")
if i == last_i:
print(
f"Last:\n\nInputs:\n",
inputs[0][-1],
"...",
inputs[-1][-1],
f"\n\nTargets:\n",
targets[0],
"...",
targets[-1],
)
print(inputs.shape, targets.shape, "\n\n")
print("\n-------------------\n")
results(train, train_index)
results(validation, validation_index)
results(test, test_index)
return (
batch_size,
train,
validation,
test,
train_index,
validation_index,
test_index,
)
# inputs and targets are expected to be aligned (i.e., loss functions should subtract the predicted target@t from the actual target@t)
x = np.arange(101)
df = pd.DataFrame(index=x)
df['inputs'] = x
df['targets'] = x
batch_size, train, validation, test, train_index, validation_index, test_index = train_validate_test_split(df['inputs'], df['targets'], train_size_ratio=0.5, max_batch_size=2, memory=8)
All loss/metric functions rely on y_pred
and y_true
assume matching indices.所有损失/度量函数都依赖于
y_pred
和y_true
假设匹配索引。 There's nothing special that Keras does in the background. Keras 在后台没有什么特别之处。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.