簡體   English   中英

滑動 window 訓練/測試拆分時間序列數據

[英]Sliding window train/test split for time series data

我有一個包含 36 個數據點的系列,我想做一個滑動 window 訓練和測試。 我看過 TimeSeriesSplit() 但它只做類似的事情

('TRAIN:', array([0, 1, 2]), 'TEST:', array([3, 4, 5]))
('TRAIN:', array([0, 1, 2, 3, 4, 5]), 'TEST:', array([6, 7, 8]))
('TRAIN:', array([0, 1, 2, 3, 4, 5, 6, 7, 8]), 'TEST:', array([ 9, 10, 11]))

我想要一個固定長度為 12 個滑動 window 的東西,每次移動 1 個點,固定長度為 3 個滑動 window 也用於測試集。 例如

('TRAIN:', array([0,1,2,3,4,5,6,7,8,9,10,11]), 
 'TEST:', array([12,13,14]))
('TRAIN:', array([1,2,3,4,5,6,7,8,9,10,11,12]), 
 'TEST:', array([13,14,15]))
('TRAIN:', array([2,3,4,5,6,7,8,9,10,11,12,13]), 
 'TEST:', array([14,15,16]))
...

我讀了這篇文章( https://ntguardian.wordpress.com/2017/06/19/walk-forward-analysis-demonstration-backtrader/ )並嘗試了

from sklearn.model_selection import TimeSeriesSplit
from sklearn.utils import indexable
from sklearn.utils.validation import _num_samples
import numpy as np

class TimeSeriesSplitImproved(TimeSeriesSplit):
    def split(self, X, y=None, groups=None, fixed_length=False,
              train_splits=1, test_splits=1):
        X, y, groups = indexable(X, y, groups)
        n_samples = _num_samples(X)
        n_splits = self.n_splits
        n_folds = n_splits + 1
        train_splits, test_splits = int(train_splits), int(test_splits)
        if n_folds > n_samples:
            raise ValueError(
                ("Cannot have number of folds ={0} greater"
                 " than the number of samples: {1}.").format(n_folds,
                                                             n_samples))
        if (n_folds - train_splits - test_splits) <= 0 and test_splits > 0:
            raise ValueError(
                ("Both train_splits and test_splits must be positive"
                 " integers."))
        indices = np.arange(n_samples)
        split_size = (n_samples // n_folds)
        test_size = split_size * test_splits
        train_size = split_size * train_splits
        test_starts = range(train_size + n_samples % n_folds,
                            n_samples - (test_size - split_size),
                            split_size)
        if fixed_length:
            for i, test_start in zip(range(len(test_starts)),
                                     test_starts):
                rem = 0
                if i == 0:
                    rem = n_samples % n_folds
                yield (indices[(test_start - train_size - rem):test_start],indices[test_start:test_start + test_size])
        else:
            for test_start in test_starts:
                yield (indices[:test_start],indices[test_start:test_start + test_size])


model = TimeSeriesSplitImproved(n_splits=5)
for train_index, test_index in model.split(X,fixed_length=True,train_splits=2, test_splits=1):
    print("TRAIN:", train_index, "TEST:", test_index)
    train, test = X[train_index], X[test_index]

只得到了這個:

TRAIN: [ 0  1  2  3  4  5  6  7  8  9 10 11] TEST: [12 13 14 15 16 17]
TRAIN: [ 6  7  8  9 10 11 12 13 14 15 16 17] TEST: [18 19 20 21 22 23]
TRAIN: [12 13 14 15 16 17 18 19 20 21 22 23] TEST: [24 25 26 27 28 29]
TRAIN: [18 19 20 21 22 23 24 25 26 27 28 29] TEST: [30 31 32 33 34 35]

提前感謝您的幫助!

考慮到您的數據集有 36 個點,您可以相當容易地手動執行此操作。 以下示例應該會有所幫助:

import numpy as np

data = list(range(36))
window_size = 12
splits = []

for i in range(window_size, len(data)):
    train = np.array(data[i-window_size:i])
    test = np.array(data[i:i+3])
    splits.append(('TRAIN:', train, 'TEST:', test))

# View result
for a_tuple in splits:
    print(a_tuple)

# ('TRAIN:', array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11]), 'TEST:', array([12, 13, 14]))
# ('TRAIN:', array([ 1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12]), 'TEST:', array([13, 14, 15]))
# ('TRAIN:', array([ 2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13]), 'TEST:', array([14, 15, 16]))

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM