简体   繁体   English

使用多个不同长度和多个特征的时间序列时,如何为 LSTM 准备数据?

[英]How to prepare data for LSTM when using multiple time series of different lengths and multiple features?

I have a dataset from a number of users (nUsers).我有一个来自多个用户 (nUsers) 的数据集。 Each user is sampled randomly in time (non-constant nSamples for each user).每个用户在时间上随机采样(每个用户的 nSamples 是非常数)。 Each sample has a number of features (nFeatures).每个样本都有许多特征 (nFeatures)。 For example:例如:

nUsers = 3 ---> 3 users nUsers = 3 ---> 3 个用户

nSamples = [32, 52, 21] ---> first user was sampled 32 times second user was sampled 52 times etc. nSamples = [32, 52, 21] ---> 第一个用户被采样了 32 次,第二个用户被采样了 52 次等等。

nFeatures = 10 ---> constant number of features for each sample. nFeatures = 10 ---> 每个样本的特征数量不变。

I would like the LSTM to produce a current prediction based on the current features and on previous predictions of the same user.我希望 LSTM 根据当前特征和同一用户的先前预测生成当前预测。 Can I do that in Keras using LSTM layer?我可以使用 LSTM 层在 Keras 中做到这一点吗? I have 2 problems:我有两个问题:

  1. The data has a different time series for each user.每个用户的数据都有不同的时间序列 How do I incorporate this?我该如何合并这个?
  2. How do I deal with adding the previous predictions into the current time feature space in order to make a current prediction?我如何处理将先前的预测添加到当前时间特征空间以进行当前预测?

Thanks for your help!谢谢你的帮助!

It sounds like each user is a sequence, so, users may be the "batch size" for your problem.听起来每个用户都是一个序列,因此,用户可能是您问题的“批量大小”。 So at first, nExamples = nUsers .所以一开始, nExamples = nUsers

If I understood your problem correctly (predict the next element), you should define a maximum length of "looking back".如果我正确理解你的问题(预测下一个元素),你应该定义“回头看”的最大长度。 Say you can predict the next element from looking at the 7 previous ones, for instance (and not looking at the entire sequence).例如,假设您可以通过查看前 7 个元素(而不是查看整个序列)来预测下一个元素。

For that, you should separate your data like this:为此,您应该像这样分离数据:

example 1: x[0] = [s0, s1, s2, ..., s6] | y[0] = s7   
example 2: x[1] = [s1, s2, s3, ..., s7] | y[1] = s8

Where sn is a sample with 10 features.其中sn是具有 10 个特征的样本。 Usually, it doesn't matter if you mix users.通常,混合用户并不重要。 Create these little segments for all users and put everything together.为所有用户创建这些小片段并将所有内容放在一起。

This will result in in arrays shaped like这将导致形状像

x.shape -> (BatchSize, 7, 10) -> (BatchSize, 7 step sequences, 10 features)   
y.shape -> (BatchSize, 10)

Maybe you don't mean predicting the next set of features, but just predicting something.也许你的意思不是预测下一组特征,而只是预测一些东西。 In that case, just replace y for the value you want.在这种情况下,只需将 y 替换为您想要的值。 That may result in y.shape -> (BatchSize,) if you want just a single result.如果您只想要一个结果,这可能会导致y.shape -> (BatchSize,)


Now, if you do need the entire sequence for predicting (instead of n previous elements), then you will have to define the maximum length and pad the sequences.现在,如果您确实需要整个序列进行预测(而不是之前的 n 个元素),那么您将必须定义最大长度并填充序列。

Suppose your longest sequence, as in your example, is 52. Then:假设您最长的序列,如您的示例所示,是 52。然后:

x.shape -> (Users, 52, 10).    

Then you will have to "pad" the sequences to fill the blanks.然后你将不得不“填充”序列来填补空白。
You can for instance fill the beginning of the sequences with zero features, such as:例如,您可以用零特征填充序列的开头,例如:

x[0] = [s0, s1, s2, ......., s51] -> user with the longest sequence    
x[1] = [0 , 0 , s0, s1, ..., s49] -> user with a shorter sequence

Or (I'm not sure this works, I never tested), pad the ending with zero values and use the Masking Layer , which is what Keras have for "variable length sequences".或者(我不确定这是否有效,我从未测试过),用零值填充结尾并使用Masking Layer ,这是 Keras 用于“可变长度序列”的功能。 You still use a fixed size array, but internally it will (?) discard the zero values.您仍然使用固定大小的数组,但在内部它会(?)丢弃零值。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM