简体   繁体   中英

Question about Train-Test Split in Time Series

I got a question about splitting the data into a training and test set in Time Series tasks. I know that the data can't be shuffled, because its important to keep the time nature of the data, so we do not create the scenario where we are able to look into the future. However, when I shuffle the data ( for experimenting ), I get a ridiculously high R-Squared score. And yes, the R Squared is evaluated with the test set. Can someone maybe simply explain why this is the case? Why does shuffling train and test data in time series produce a high R-Squared score? My guess is that it has something to the with the trend of the time series, but i am not sure. I am just asking out of curiosity, thanks !

It really depends upon your problem. If:

  1. if your model has no memory, and merely a mapping tasks then attached timestamp does not have any significance it is better in fact recommended to shuffle the data for better distribution. If this is the case and you are getting a higher R-squaed value you shoud definitely go for it. (I assume this is the case since R-squared is usually used for these types of tasks)
  2. If your task is pattrn dependent and each prediction is affecting next in the sequence. This is where order matters. In this case you should never shuffle the data. Any metric which suggest that is lying. The best you can do is split train and test set based on a timestamp prior to which you have your train set and afterwards test set. Then divide train and test sets into fixed time windows. You can shuffle those windows now only if the window span is large enough for your case.

Hope this helps!

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM