简体   繁体   English

如何按年份创建时间序列数据的训练/测试拆分?

[英]How to create a train/test split of time-series data by year?

I want to cross-validate my time-series data and split by the year of the timestamp.我想交叉验证我的时间序列数据并按时间戳的年份进行拆分。

Here is the following data in a pandas dataframe:这是 pandas dataframe 中的以下数据:

mock_data

timestamp             counts
'2015-01-01 03:45:14' 4
     .
     .
     .
'2016-01-01 13:02:14' 12
     .
     .
     .
'2017-01-01 09:56:54' 6
     .
     .
     .
'2018-01-01 13:02:14' 8
     .
     .
     .
'2019-01-01 11:39:40' 24
     .
     .
     .
'2020-01-01 04:02:03' 30

mock_data.dtypes
timestamp object
counts    int64

Looking into the TimeSeriesSplit() function of scikit-learn, it does not appear that you can specify the n_split part by year.查看 scikit-learn 的TimeSeriesSplit() function ,您似乎无法按年指定n_split部分。 Is there another way that one can create successive training sets that result in the following train-test split?是否有另一种方法可以创建导致以下训练测试拆分的连续训练集?

tscv = newTimeSeriesSplit(n_splits=5, by='year')
>>> print(tscv)  
newTimeSeriesSplit(max_train_size=None, n_splits=5, by='year')
>>> for train_index, test_index in tscv.split(mock_data):
...    print("TRAIN:", train_index, "TEST:", test_index)
...    X_train, X_test = X[train_index], X[test_index]
...    y_train, y_test = y[train_index], y[test_index]
TRAIN: [2015] TEST: [2016]
TRAIN: [2015 2016] TEST: [2017]
TRAIN: [2015 2016 2017] TEST: [2018]
TRAIN: [2015 2016 2017 2018] TEST: [2019]
TRAIN: [2015 2016 2017 2018 2019] TEST: [2020]

Thanks for viewing!感谢观看!

Updated Response更新的回应

Generic approach for data with arbitrary number of points in each year.每年具有任意点数的数据的通用方法。

First, some data with a few years of data with differing numbers of points in each, per the example.首先,根据示例,一些数据包含几年的数据,每个数据的点数不同。 This is similar in approach to the original answer.这与原始答案的方法相似。

import numpy as np
import pandas as pd

ts_2015 = pd.date_range('2015-01-01', '2015-12-31', periods=4).to_series()
ts_2016 = pd.date_range('2016-01-01', '2016-12-31', periods=12).to_series()
ts_2017 = pd.date_range('2017-01-01', '2017-12-31', periods=6).to_series()
ts_2018 = pd.date_range('2018-01-01', '2018-12-31', periods=8).to_series()
ts_2019 = pd.date_range('2019-01-01', '2019-12-31', periods=24).to_series()
ts_2020 = pd.date_range('2020-01-01', '2020-12-31', periods=30).to_series()
ts_all = pd.concat([ts_2015, ts_2016, ts_2017, ts_2018, ts_2019, ts_2020])

df = pd.DataFrame({'X': np.random.randint(0, 100, size=ts_all.shape), 
                   'Y': np.random.randint(100, 200, size=ts_all.shape)},
                 index=ts_all)
df['year'] = df.index.year
df = df.reset_index()

Now we create a list of the unique years to iterate over and a dict to store the various split dataframes.现在我们创建一个唯一年份列表来迭代和一个字典来存储各种拆分数据帧。

year_list = df['year'].unique().tolist()
splits = {'train': [], 'test': []}

for idx, yr in enumerate(year_list[:-1]):
    train_yr = year_list[:idx+1]
    test_yr = [year_list[idx+1]]
    print('TRAIN: ', train_yr, 'TEST: ',test_yr)

    splits['train'].append(df.loc[df.year.isin(train_yr), :])
    splits['test'].append(df.loc[df.year.isin(test_yr), :])

Result:结果:

TRAIN:  [2015] TEST:  [2016]
TRAIN:  [2015, 2016] TEST:  [2017]
TRAIN:  [2015, 2016, 2017] TEST:  [2018]
TRAIN:  [2015, 2016, 2017, 2018] TEST:  [2019]
TRAIN:  [2015, 2016, 2017, 2018, 2019] TEST:  [2020]

The split dataframes would look something like the following:拆分的数据帧如下所示:

>>> splits['train'][0]

                index   X    Y  year
0 2015-01-01 00:00:00  20  127  2015
1 2015-05-02 08:00:00  25  197  2015
2 2015-08-31 16:00:00  61  185  2015
3 2015-12-31 00:00:00  75  144  2015

Original Response原始回复

It was pointed out to me that this approach would not work because it assumes that each year contains the same number of records.有人向我指出,这种方法行不通,因为它假定每年包含相同数量的记录。

Your intent is a little unclear, but I believe you want to do is to pass a dataframe with a timestamp index into a new version of the TimeSeriesSplit class that will yield n_split = n_years - 1 based on the number of years in your data.您的意图有点不清楚,但我相信您想要做的是将带有时间戳索引dataframe传递到新版本的TimeSeriesSplit class 中,这将根据数据中的年数产生n_split = n_years - 1 The TimeSeriesSplit class gives you the flexibility to do this, but you need to extract the year from your timestamp index first. TimeSeriesSplit class 为您提供了执行此操作的灵活性,但您需要先从时间戳索引中提取年份。 The result doesn't quite look like what you've proposed, but the outcome is, I believe, what you want.结果看起来不像你提议的那样,但我相信结果是你想要的。

First some dummy data:首先是一些虚拟数据:

import numpy as np
import pandas as pd
from sklearn.model_selection import TimeSeriesSplit

ts_index = pd.date_range('2015-01-01','2020-12-31',freq='M')
df = pd.DataFrame({'X': np.random.randint(0, 100, size=ts_index.shape), 
                   'Y': np.random.randint(100, 200, size=ts_index.shape)},
                 index=ts_index)

Now a year for the TimeSeriesSplit to work on.现在是TimeSeriesSplit工作的一年。 Because we have to index into this thing by row number and pd.ix is deprecated, I reset the index from timestamp to numerical:因为我们必须按行号索引这个东西并且不推荐使用pd.ix ,所以我将索引从时间戳重置为数字:

df['year'] = df.index.year
df = df.reset_index()

And then a TimeSeriesSplit instance with the correct number of splits ( n_years - 1 ):然后是具有正确拆分数( n_years - 1 )的TimeSeriesSplit实例:

tscv = TimeSeriesSplit(n_splits=len(df['year'].unique()) - 1)

Now we can generate the indices.现在我们可以生成索引了。 Instead of printing the indices, print the year column that corresponds and only print the unique years:不要打印索引,而是打印对应的年份列,并且只打印唯一的年份:

for train_idx, test_idx in tscv.split(df['year']):
    print('TRAIN: ', df.loc[df.index.isin(train_idx), 'year'].unique(), 
          'TEST: ', df.loc[df.index.isin(test_idx), 'year'].unique())

TRAIN:  [2015] TEST:  [2016]
TRAIN:  [2015 2016] TEST:  [2017]
TRAIN:  [2015 2016 2017] TEST:  [2018]
TRAIN:  [2015 2016 2017 2018] TEST:  [2019]
TRAIN:  [2015 2016 2017 2018 2019] TEST:  [2020]

You would of course access your training/test sets in a similar manner.您当然会以类似的方式访问您的训练/测试集。 If you really wanted to button this up nicely, you could extend the TimeSeriesSplit class and either customize the initialization or add some new methods.如果您真的想很好地解决这个问题,您可以扩展TimeSeriesSplit class 并自定义初始化或添加一些新方法。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM