简体   繁体   English

在 Tensorflow 中使用具有 Multiple.csv 的大型数据集的具有时间序列数据的 LSTM 输入管道

[英]Input Pipeline for LSTM with Timeseries Data Using a Large Dataset with Multiple .csv in Tensorflow

Currently I can train a LSTM network using one csv file based on this tutorial: https://machinelearningmastery.com/how-to-develop-lstm-models-for-time-series-forecasting/目前我可以根据本教程使用一个 csv 文件训练 LSTM 网络: https ://machinelearningmastery.com/how-to-develop-lstm-models-for-time-series-forecasting/

This code generate sliding windows where the last n_steps of the features are saved to predict the actual target (similar to this: Keras LSTM - feed sequence data with Tensorflow dataset API from the generator ):此代码生成滑动窗口,其中保存最后n_steps的特征以预测实际目标(类似于此: Keras LSTM - 使用来自生成器的 Tensorflow 数据集 API 提供序列数据):

#%% Import
import pandas as pd
import tensorflow as tf
from tensorflow.python.keras.models import Sequential, model_from_json
from tensorflow.python.keras.layers import LSTM
from tensorflow.python.keras.layers import Dense

# for path 
import pathlib
import os

#%% Define functions
# Function to split multivariate input data into samples according to the number of timesteps (n_steps) used for the prediction ("sliding window")
def split_sequences(sequences, n_steps):
    X, y = list(), list()
    for i in range(len(sequences)):
        # find end of this pattern
        end_ix = i + n_steps
        # check if beyond maximum index of input data
        if end_ix > len(sequences):
            break
        # gather input and output parts of the data in corresponding format (depending on n_steps)
        seq_x, seq_y = sequences[i:end_ix, :-1], sequences[end_ix-1, -1]
        X.append(seq_x)
        y.append(seq_y)
        #Append: Adds its argument as a single element to the end of a list. The length of the list increases by one.
    return array(X), array(y)

# Set source files
csv_train_path = os.path.join(dir_of_file, 'SimulationData', 'SimulationTrainData', 'SimulationTrainData001.csv')

# Load data
df_train = pd.read_csv(csv_train_path, header=0, parse_dates=[0], index_col=0)


#%% Select features and target
features_targets_considered = ['Fz1', 'Fz2', 'Fz3', 'Fz4', 'Fz5', 'Fz_res']
n_features = len(features_targets_considered)-1 # substract the target 

features_targets_train = df_train[features_targets_considered]

# "Convert" to array
train_values = features_targets_train.values

# Set number of previous timesteps, which are considered to predict 
n_steps = 100

# Convert into input (400x5) and output (1) values 
X, y = split_sequences(train_values, n_steps)
X_test, y_test = split_sequences(test_values, n_steps)


#%% Define model
model = Sequential()
model.add(LSTM(200, activation='relu', return_sequences=True, input_shape=(n_steps, n_features)))
model.add(LSTM(200, activation='relu', return_sequences=True))
model.add(LSTM(200, activation='relu'))
model.add(Dense(1))
model.compile(optimizer='adam', loss='mse', metrics=['mae'])

#%% Fit model
history = model.fit(X, y, epochs=200, verbose=1)

I now want to expand this example to efficiently train the network with different csv files.我现在想扩展此示例以使用不同的 csv 文件有效地训练网络。 In the data folder I have the files 'SimulationTrainData001.csv', 'SimulationTrainData002.csv', ..., 'SimulationTrainData300.csv' (about 14 GB).在数据文件夹中,我有文件“SimulationTrainData001.csv”、“SimulationTrainData002.csv”、...、“SimulationTrainData300.csv”(大约 14 GB)。 To achieve this, I tried to adopt the code of this input pipeline example: https://www.tensorflow.org/guide/data#consuming_sets_of_files , which works to a certain extend.为此,我尝试采用此输入管道示例的代码: https ://www.tensorflow.org/guide/data#consuming_sets_of_files,它在一定程度上起作用。 I can show the training files in the folder with this change:我可以通过此更改显示文件夹中的培训文件:

# Set source folders
csv_train_path = os.path.join(dir_of_file, 'SimulationData', 'SimulationTrainData')
csv_train_path = pathlib.Path(csv_train_path)

#%% Show five example files from training folder
list_ds = tf.data.Dataset.list_files(str(csv_train_path/'*'))

for f in list_ds.take(5):
  print(f.numpy())

One problem is, that in the example the files are pictures of flowers and not time series values and I do not know at which point I can use the split_sequences(sequences, n_steps) function to create the sliding windows to provide the necessary data format to train the LSTM network.一个问题是,在这个例子中,文件是花的图片而不是时间序列值,我不知道在什么时候我可以使用split_sequences(sequences, n_steps)函数来创建滑动窗口来提供必要的数据格式训练 LSTM 网络。

Also, as far as I know, it would be better for the training process, if the generated windows of the different files would be shuffled.另外,据我所知,如果将不同文件的生成窗口进行混洗,那么训练过程会更好。 I could use the split_sequences(sequences, n_steps) function on every csv file (to generate X_test , y_test ) and join the result in one big variable or file and shuffle the windows, but I do not think this is an efficient way and it also had to be redone if n_steps will be changed.我可以在每个 csv 文件上使用split_sequences(sequences, n_steps)函数(生成X_testy_test )并将结果加入一个大变量或文件中并随机播放窗口,但我认为这不是一种有效的方法,它也如果要更改n_steps ,则必须重做。

If somebody could suggest a (established) method or example to preprocess my data, I would be very thankful.如果有人可以建议一个(已建立的)方法或示例来预处理我的数据,我将非常感激。

You can use the TimeSeriesGenerator after consuming those sets of files.您可以在使用这些文件集后使用 TimeSeriesGenerator。
Here is the reference link .这是参考链接

As per the documentation: ''' This class takes in a sequence of data-points gathered at equal intervals, along with time-series parameters such as stride, length of history, etc., to produce batches for training/validation.根据文档:'''此类采用等间隔收集的一系列数据点,以及时间序列参数(例如步幅、历史长度等),以生成用于训练/验证的批次。 ''' '''

Provided examples for both univariate & multiple variate scenario提供了单变量和多变量场景的示例

Univariate Example :单变量示例


from tensorflow.keras.preprocessing.sequence import TimeseriesGenerator
from tensorflow.keras import Sequential
from tensorflow.keras.layers import Dense, LSTM 
import numpy as np
import tensorflow as tf

# define dataset
series = np.array([1, 2, 3, 4, 5, 6, 7, 8, 9, 10])

# reshape to [10, 1]
n_features = 1
series = series.reshape((len(series), n_features))

# define generator
n_input = 2
generator = TimeseriesGenerator(series, series, length=n_input, batch_size=8)

# create model
model = Sequential()
model.add(LSTM(100, activation='relu', input_shape=(n_input, n_features)))
model.add(Dense(1))
model.compile(optimizer='adam', loss='mse')

# fit model
model.fit_generator(generator, steps_per_epoch=1, epochs=500, verbose=1)

#sample prediction
inputs = np.array([9, 10]).reshape((1, n_input, n_features))
result = model.predict(inputs, verbose=0)
print(result)

Multi-variate Example多变量示例

from tensorflow.keras.preprocessing.sequence import TimeseriesGenerator
from tensorflow.keras import Sequential
from tensorflow.keras.layers import Dense, LSTM 
import numpy as np
import tensorflow as tf

# define dataset
in_seq1 = np.array([10, 20, 30, 40, 50, 60, 70, 80, 90, 100])
in_seq2 = np.array([15, 25, 35, 45, 55, 65, 75, 85, 95, 105])
# reshape series
in_seq1 = in_seq1.reshape((len(in_seq1), 1))
in_seq2 = in_seq2.reshape((len(in_seq2), 1))
# horizontally stack columns
dataset = np.hstack((in_seq1, in_seq2))
# define generator
n_features = dataset.shape[1]
n_input = 2
generator = TimeseriesGenerator(dataset, dataset, length=n_input, batch_size=8)
# define model
model = Sequential()
model.add(LSTM(100, activation='relu', input_shape=(n_input, n_features)))
model.add(Dense(2))
model.compile(optimizer='adam', loss='mse')
# fit model
model.fit_generator(generator, steps_per_epoch=1, epochs=500, verbose=1)

# make a one step prediction out of sample
inputs = np.array([[90, 95], [100, 105]]).reshape((1, n_input, n_features))
result = model.predict(inputs, verbose=1)
print(result)

Note: All of these were simulated using Google Colaboratory注意:所有这些都是使用 Google Colaboratory 模拟的

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 使用Tensorflow Dataset API读取TFRecords文件时,预处理输入数据会减慢输入管道的速度 - Preprocess the input data slow down the input pipeline when using Tensorflow Dataset API to read TFRecords file 分别使用 timeseries_dataset_from_array 和 TimeseriesGenerator 对齐 tensorflow/keras 的批量滑动帧时间序列数据 - Aligning batched sliding frame timeseries data for tensorflow/keras using timeseries_dataset_from_array and TimeseriesGenerator respectively 输入到 LSTM 的时间序列 - Timeseries input to an LSTM Tensorflow2 - 使用“tf.data.experimental.make_csv_dataset”和“tf.keras.preprocessing.timeseries_dataset_from_array” - Tensorflow2 - Use "tf.data.experimental.make_csv_dataset" with "tf.keras.preprocessing.timeseries_dataset_from_array" Tensorflow:如何转换Tensorflow LSTM的输入数据? - Tensorflow : How to transform the input data for tensorflow LSTM? 具有非常大的 HDF5 文件的 Tensorflow-IO 数据集输入管道 - Tensorflow-IO Dataset input pipeline with very large HDF5 files 从 TensorFlow 中的 CSV 文件加载大型数据集 - Loading a large dataset from CSV files in TensorFlow 将单独的时间序列组合成 Tensorflow 数据集 - Combine separate timeseries into Tensorflow dataset 具有多个输出的 LSTM 时间序列预测 - LSTM timeseries prediction with multiple outputs 用更少的内存为 LSTM 时间序列创建数据集 - Creating Dataset for LSTM TimeSeries with less memory
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM