简体繁体 English

关于时间序列中训练测试拆分的问题

[英]Question about Train-Test Split in Time Series

原文 2020-05-31 14:34:35 0 1 python

I got a question about splitting the data into a training and test set in Time Series tasks.我有一个关于在时间序列任务中将数据拆分为训练和测试集的问题。 I know that the data can't be shuffled, because its important to keep the time nature of the data, so we do not create the scenario where we are able to look into the future.我知道数据不能被洗牌，因为保持数据的时间性很重要，所以我们不会创造我们能够展望未来的场景。 However, when I shuffle the data ( for experimenting ), I get a ridiculously high R-Squared score.然而，当我对数据进行洗牌（用于实验）时，我得到了一个高得离谱的 R-Squared 分数。 And yes, the R Squared is evaluated with the test set.是的，R Squared 使用测试集进行了评估。 Can someone maybe simply explain why this is the case?有人可以简单地解释为什么会这样吗？ Why does shuffling train and test data in time series produce a high R-Squared score?为什么在时间序列中改组训练和测试数据会产生高 R-Squared 分数？ My guess is that it has something to the with the trend of the time series, but i am not sure.我的猜测是它与时间序列的趋势有关，但我不确定。 I am just asking out of curiosity, thanks !我只是出于好奇而问，谢谢！

1 个解决方案

It really depends upon your problem.这真的取决于你的问题。 If:如果：

if your model has no memory, and merely a mapping tasks then attached timestamp does not have any significance it is better in fact recommended to shuffle the data for better distribution.如果您的 model 没有 memory，并且只是一个映射任务，那么附加的时间戳没有任何意义，实际上建议对数据进行混洗以获得更好的分布。 If this is the case and you are getting a higher R-squaed value you shoud definitely go for it.如果是这种情况，并且您获得了更高的 R 平方值，那么您绝对应该使用 go。 (I assume this is the case since R-squared is usually used for these types of tasks) （我认为是这种情况，因为 R-squared 通常用于这些类型的任务）
If your task is pattrn dependent and each prediction is affecting next in the sequence.如果您的任务依赖于模式并且每个预测都会影响序列中的下一个。 This is where order matters.这就是顺序很重要的地方。 In this case you should never shuffle the data.在这种情况下，您永远不应该对数据进行洗牌。 Any metric which suggest that is lying.任何表明这是谎言的指标。 The best you can do is split train and test set based on a timestamp prior to which you have your train set and afterwards test set.您可以做的最好的事情是根据时间戳拆分训练集和测试集，在此之前您拥有训练集和之后的测试集。 Then divide train and test sets into fixed time windows.然后将训练集和测试集划分为固定时间 windows。 You can shuffle those windows now only if the window span is large enough for your case.仅当 window 跨度对于您的情况足够大时，您现在可以对这些 windows 进行洗牌。