使用 Keras 理解多元时间序列分类

Question

I am trying to understand how to correctly feed data into my keras model to classify multivariate time series data into three classes using a LSTM neural network.我试图了解如何正确地将数据输入我的 keras 模型，以使用 LSTM 神经网络将多元时间序列数据分为三类。

I looked at different resources already - mainly these three excellent blog posts by Jason Brownlee post1 , post2 , post3 ), other SO questions and different papers , but none of the information given there exactly fits my problem case, and I was not able to figure out if my data preprocessing / feeding it into the model is correct, so I guessed I might get some help if I specify my exact conditions here.我已经查看了不同的资源 - 主要是 Jason Brownlee post1 、 post2 、 post3 的这三篇优秀的博客文章）、其他 SO 问题和不同的论文，但是那里提供的信息都没有完全适合我的问题案例，我无法弄清楚如果我的数据预处理/将其输入模型是正确的，那么我想如果我在这里指定我的确切条件，我可能会得到一些帮助。

What I am trying to do is classify multivariate time series data, which in its original form is structured as follows:我想要做的是对多元时间序列数据进行分类，其原始形式的结构如下：

I have 200 samples我有 200 个样品
One sample is one csv file.一个示例是一个 csv 文件。
A sample can have 1 to 50 features (ie the csv file has 1 to 50 columns).一个样本可以有 1 到 50 个特征（即 csv 文件有 1 到 50 列）。
Each feature has its value "tracked" over a fixed amount of time steps, let's say 100 (ie each csv file has exactly 100 rows).每个特征都有其在固定时间步长内“跟踪”的值，假设为 100（即每个 csv 文件正好有 100 行）。
Each csv file has one of three classes ("good", "too small", "too big")每个 csv 文件都有三个类别之一（“好”、“太小”、“太大”）

So what my current status looks like is the following:所以我目前的状态如下：

I have a numpy array "samples" with the following structure:我有一个具有以下结构的 numpy 数组“样本” ：

# array holding all samples
[
    # sample 1        
    [
        # feature 1 of sample 1 
        [ 0.1, 0.2, 0.3, 0.2, 0.3, 0.1, 0.2, 0.4, 0.5, 0.1, ... ], # "time series" of feature 1
        # feature 2 of sample 1 
        [ 0.5, 0.6, 0.7, 0.6, 0.4, 0.3, 0.2, 0.1, -0.1, -0.2, ... ], # "time series" of feature 2
        ... # up to 50 features
    ],
    # sample 2        
    [
        # feature 1 of sample 2 
        [ 0.1, 0.2, 0.3, 0.2, 0.3, 0.1, 0.2, 0.4, 0.5, 0.1, ... ], # "time series" of feature 1
        # feature 2 of sample 2 
        [ 0.5, 0.6, 0.7, 0.6, 0.4, 0.3, 0.2, 0.1, -0.1, -0.2, ... ], # "time series" of feature 2
        ...  # up to 50 features
    ],
    ... # up to sample no. 200
]

I also have a numpy array "labels" with the same length as the "samples" array (ie 200).我还有一个 numpy 数组“标签” ，其长度与“样本”数组（即 200）相同。 The labels are encoded in the following way:标签按以下方式编码：

"good" = 0 “好”= 0
"too small" = 1 “太小”= 1
"too big" = 2 “太大”= 2

[0, 2, 2, 1, 0, 1, 2, 0, 0, 0, 1, 2, ... ] # up to label no. 200

This "labels" array is then encoded with keras' to_categorical function然后使用 keras 的to_categorical函数对这个“标签”数组进行编码

to_categorical(labels, len(np.unique(labels)))

My model definition currently looks like that:我的模型定义目前看起来像这样：

max_nb_features = 50
nb_time_steps = 100

model = Sequential()
model.add(LSTM(5, input_shape=(max_nb_features, nb_time_steps)))
model.add(Dense(3, activation='softmax'))
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])

The 5 units in the LSTM layer are just randomly picked for now LSTM 层中的 5 个单元目前只是随机选取的
3 Output neurons in the dense layer for my three classes 3 我的三个类的密集层中的输出神经元

I then split the data into training / testing data:然后我将数据拆分为训练/测试数据：

samples_train, samples_test, labels_train, labels_test = train_test_split(samples, labels, test_size=0.33)

This leaves us with 134 samples for training and 66 samples for testing.这给我们留下了 134 个用于训练的样本和 66 个用于测试的样本。

The problem I'm currenty running into, is that the following code is not working:我目前遇到的问题是以下代码不起作用：

model.fit(samples_train, labels_train, epochs=1, batch_size=1)

The error is the following:错误如下：

Traceback (most recent call last):
  File "lstm_test.py", line 152, in <module>
    model.fit(samples_train, labels_train, epochs=1, batch_size=1)
  File "C:\Program Files\Python36\lib\site-packages\keras\models.py", line 1002, in fit
    validation_steps=validation_steps)
  File "C:\Program Files\Python36\lib\site-packages\keras\engine\training.py", line 1630, in fit
    batch_size=batch_size)
  File "C:\Program Files\Python36\lib\site-packages\keras\engine\training.py", line 1476, in _standardize_user_data
    exception_prefix='input')
  File "C:\Program Files\Python36\lib\site-packages\keras\engine\training.py", line 113, in _standardize_input_data
    'with shape ' + str(data_shape))

ValueError: Error when checking input: expected lstm_1_input to have 3 dimensions, but got array with shape (134, 1)

For me, it seems to not work because of the variable amount of features my samples can have.对我来说，它似乎不起作用，因为我的样本可以具有可变数量的特征。 If I use "fake" (generated) data, where all parameters are the same, except each sample has exactly the same amount of features (50), the code works.如果我使用“假”（生成的）数据，其中所有参数都相同，除了每个样本具有完全相同数量的特征 (50)，代码工作。

Now what I'm trying to understand is:现在我想了解的是：

Are my general assumptions on how I structured my data for the LSTM input correct?我对如何为 LSTM 输入构建数据的一般假设是否正确？ Are the parameters ( batch_size , input_shape ) correct / sensible?参数（ batch_size 、 input_shape ）是否正确/合理？
Is the keras LSTM model in general able to handle samples with different amount of features? keras LSTM 模型通常能够处理具有不同特征数量的样本吗？
If yes, how do I have to adapt my code for it to work with different amount of features?如果是，我必须如何调整我的代码才能使用不同数量的功能？
If no, would "zero-padding" (filling) the columns in the samples with less than 50 features work?如果不是，“零填充”（填充）样本中具有少于 50 个特征的列是否有效？ Are there other, preferred methods of achieving my goal?还有其他首选方法可以实现我的目标吗？

Answer 1

I believe the input shape for Keras should be:我相信 Keras 的输入形状应该是：

input_shape=(number_of_samples, nb_time_steps, max_nb_features). input_shape=(number_of_samples, nb_time_steps, max_nb_features)。

And most often nb_time_steps = 1最常见的是 nb_time_steps = 1

PS: I tried solving a very similar problem for an internship position (but my results turned out to be wrong). PS：我尝试为实习职位解决一个非常相似的问题（但结果证明我的结果是错误的）。 You may take a look here: https://github.com/AbbasHub/Deep_Learning_LSTM/blob/master/2018-09-22_Multivariate_LSTM.ipynb (see if you can spot my mistake!)你可以看看这里： https : //github.com/AbbasHub/Deep_Learning_LSTM/blob/master/2018-09-22_Multivariate_LSTM.ipynb （看看你能不能发现我的错误！）

Answer 2

The LSTM model requires a 3D input in the form of [samples, time steps, features] LSTM 模型需要 [样本、时间步长、特征] 形式的 3D 输入

When defining the first layer of our LSTM model, we need to specify only the time steps and features .在定义我们的 LSTM 模型的第一层时，我们只需要指定时间步长和特征。 Even though this may seem 2D it is actually 3D as the samples size ie batch size is specified at the time of model fit.尽管这看起来是 2D，但实际上是 3D，因为样本大小，即批量大小是在模型拟合时指定的。

features = x_train_d.shape[1]

Hence, we first need to reshape our input in the 3D format:因此，我们首先需要以 3D 格式重塑我们的输入：

x_train_d = np.reshape(x_train_d, (x_train_d.shape[0], 1, x_train_d.shape[1]))

Here goes the LSTM first layer:这是 LSTM 第一层：

model.add(LSTM(5,input_shape=(1, features),activation='relu'))

And the model fit specifies the samples =50 as expected by LSTM并且模型拟合指定了 LSTM 预期的样本=50

model.fit(x_train_d,y_train_d.values,batch_size=50,epochs=100)

For the question asked with variable length inputs, there is a discussion initiated at https://datascience.stackexchange.com/questions/26366/training-an-rnn-with-examples-of-different-lengths-in-keras对于使用可变长度输入提出的问题，在https://datascience.stackexchange.com/questions/26366/training-an-rnn-with-examples-of-different-lengths-in-keras发起了讨论

使用 Keras 理解多元时间序列分类

问题描述

2 个解决方案

解决方案1
2 已采纳 2018-09-28 02:41:25

解决方案2
1 2020-05-13 07:28:55

使用 Keras 理解多元时间序列分类

问题描述

2 个解决方案

解决方案1 2 已采纳 2018-09-28 02:41:25

解决方案2 1 2020-05-13 07:28:55

解决方案1
2 已采纳 2018-09-28 02:41:25

解决方案2
1 2020-05-13 07:28:55