使用 Keras 理解多元时间序列分类

Question

我试图了解如何正确地将数据输入我的 keras 模型，以使用 LSTM 神经网络将多元时间序列数据分为三类。

我已经查看了不同的资源 - 主要是 Jason Brownlee post1 、 post2 、 post3 的这三篇优秀的博客文章）、其他 SO 问题和不同的论文，但是那里提供的信息都没有完全适合我的问题案例，我无法弄清楚如果我的数据预处理/将其输入模型是正确的，那么我想如果我在这里指定我的确切条件，我可能会得到一些帮助。

我想要做的是对多元时间序列数据进行分类，其原始形式的结构如下：

我有 200 个样品
一个示例是一个 csv 文件。
一个样本可以有 1 到 50 个特征（即 csv 文件有 1 到 50 列）。
每个特征都有其在固定时间步长内“跟踪”的值，假设为 100（即每个 csv 文件正好有 100 行）。
每个 csv 文件都有三个类别之一（“好”、“太小”、“太大”）

所以我目前的状态如下：

我有一个具有以下结构的 numpy 数组“样本” ：

# array holding all samples
[
    # sample 1        
    [
        # feature 1 of sample 1 
        [ 0.1, 0.2, 0.3, 0.2, 0.3, 0.1, 0.2, 0.4, 0.5, 0.1, ... ], # "time series" of feature 1
        # feature 2 of sample 1 
        [ 0.5, 0.6, 0.7, 0.6, 0.4, 0.3, 0.2, 0.1, -0.1, -0.2, ... ], # "time series" of feature 2
        ... # up to 50 features
    ],
    # sample 2        
    [
        # feature 1 of sample 2 
        [ 0.1, 0.2, 0.3, 0.2, 0.3, 0.1, 0.2, 0.4, 0.5, 0.1, ... ], # "time series" of feature 1
        # feature 2 of sample 2 
        [ 0.5, 0.6, 0.7, 0.6, 0.4, 0.3, 0.2, 0.1, -0.1, -0.2, ... ], # "time series" of feature 2
        ...  # up to 50 features
    ],
    ... # up to sample no. 200
]

我还有一个 numpy 数组“标签” ，其长度与“样本”数组（即 200）相同。 标签按以下方式编码：

“好”= 0
“太小”= 1
“太大”= 2

[0, 2, 2, 1, 0, 1, 2, 0, 0, 0, 1, 2, ... ] # up to label no. 200

然后使用 keras 的to_categorical函数对这个“标签”数组进行编码

to_categorical(labels, len(np.unique(labels)))

我的模型定义目前看起来像这样：

max_nb_features = 50
nb_time_steps = 100

model = Sequential()
model.add(LSTM(5, input_shape=(max_nb_features, nb_time_steps)))
model.add(Dense(3, activation='softmax'))
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])

LSTM 层中的 5 个单元目前只是随机选取的
3 我的三个类的密集层中的输出神经元

然后我将数据拆分为训练/测试数据：

samples_train, samples_test, labels_train, labels_test = train_test_split(samples, labels, test_size=0.33)

这给我们留下了 134 个用于训练的样本和 66 个用于测试的样本。

我目前遇到的问题是以下代码不起作用：

model.fit(samples_train, labels_train, epochs=1, batch_size=1)

错误如下：

Traceback (most recent call last):
  File "lstm_test.py", line 152, in <module>
    model.fit(samples_train, labels_train, epochs=1, batch_size=1)
  File "C:\Program Files\Python36\lib\site-packages\keras\models.py", line 1002, in fit
    validation_steps=validation_steps)
  File "C:\Program Files\Python36\lib\site-packages\keras\engine\training.py", line 1630, in fit
    batch_size=batch_size)
  File "C:\Program Files\Python36\lib\site-packages\keras\engine\training.py", line 1476, in _standardize_user_data
    exception_prefix='input')
  File "C:\Program Files\Python36\lib\site-packages\keras\engine\training.py", line 113, in _standardize_input_data
    'with shape ' + str(data_shape))

ValueError: Error when checking input: expected lstm_1_input to have 3 dimensions, but got array with shape (134, 1)

对我来说，它似乎不起作用，因为我的样本可以具有可变数量的特征。 如果我使用“假”（生成的）数据，其中所有参数都相同，除了每个样本具有完全相同数量的特征 (50)，代码工作。

现在我想了解的是：

我对如何为 LSTM 输入构建数据的一般假设是否正确？ 参数（ batch_size 、 input_shape ）是否正确/合理？
keras LSTM 模型通常能够处理具有不同特征数量的样本吗？
如果是，我必须如何调整我的代码才能使用不同数量的功能？
如果不是，“零填充”（填充）样本中具有少于 50 个特征的列是否有效？ 还有其他首选方法可以实现我的目标吗？

Answer 1

我相信 Keras 的输入形状应该是：

input_shape=(number_of_samples, nb_time_steps, max_nb_features)。

最常见的是 nb_time_steps = 1

PS：我尝试为实习职位解决一个非常相似的问题（但结果证明我的结果是错误的）。 你可以看看这里： https : //github.com/AbbasHub/Deep_Learning_LSTM/blob/master/2018-09-22_Multivariate_LSTM.ipynb （看看你能不能发现我的错误！）

Answer 2

LSTM 模型需要 [样本、时间步长、特征] 形式的 3D 输入

在定义我们的 LSTM 模型的第一层时，我们只需要指定时间步长和特征。 尽管这看起来是 2D，但实际上是 3D，因为样本大小，即批量大小是在模型拟合时指定的。

features = x_train_d.shape[1]

因此，我们首先需要以 3D 格式重塑我们的输入：

x_train_d = np.reshape(x_train_d, (x_train_d.shape[0], 1, x_train_d.shape[1]))

这是 LSTM 第一层：

model.add(LSTM(5,input_shape=(1, features),activation='relu'))

并且模型拟合指定了 LSTM 预期的样本=50

model.fit(x_train_d,y_train_d.values,batch_size=50,epochs=100)

对于使用可变长度输入提出的问题，在https://datascience.stackexchange.com/questions/26366/training-an-rnn-with-examples-of-different-lengths-in-keras发起了讨论

使用 Keras 理解多元时间序列分类

问题描述

2 个解决方案

解决方案1
2 已采纳 2018-09-28 02:41:25

解决方案2
1 2020-05-13 07:28:55

使用 Keras 理解多元时间序列分类

问题描述

2 个解决方案

解决方案1 2 已采纳 2018-09-28 02:41:25

解决方案2 1 2020-05-13 07:28:55

解决方案1
2 已采纳 2018-09-28 02:41:25

解决方案2
1 2020-05-13 07:28:55