简体   繁体   中英

sklearn preprocessing with a rolling window?

I am analyzing timeseries with sklearn. For this, I already implemented a walkforward cross-validation split scheme. Furthermore, I am now trying to find an efficient way to apply preprocessors to the input data, without introducing information leakage.

Obviously, if I simply apply for instance a standarscaler to a time series training set X with t in [0,T], then the scaler would transform X like this:

Z = (X-mean(X))/std(X)

where values Z_t' are computed based t in [0,T] which includes forwardlooking information {t>=t'}. In order to prevent this, one needs to successively apply the scaling method to each sample X_t based on data in (for instance) a rolling lookback window of size n_roll:

Z_t = (X_t - mean(X[t-n_roll:t]))/std(X[t-n_roll:t]) for all t in [n_roll,T]

This cuts away a portion of the training set corresponding to the size of the sliding window at the beginning. Also, depending on the size of the sliding window, this will be (much) more computationally intense, then just applying the scaling just once.

I am looking for a general way to adapt the preprocessing scalers in sklearn such that they can be applied in this fashion to get the transformed feature matrix Z without forward-looking bias and are compatible with pipeline.

Is this possible, or should I just write a manual function that preprocesses the data like this before I feed it into the pipeline?

EDIT

Let me add a first example code that I wrote to process the data in advance of feeding it into the pipeline:

class WalkForwardTransformer():
    
    def __init__(self,transformer,n_roll,method='<t'):
        self.transformer = transformer
        self.method = method
        self.n_roll = n_roll
        return 
    
    def generate_walkforward_chunks(self,X):
        for i in range(self.n_roll,len(X)):
            yield X.iloc[i-self.n_roll:i]
            
    def transform(self, X: pd.DataFrame, verbose=0):
        ix = X.index[self.n_roll:]
        Xgen = self.generate_walkforward_chunks(X)
        Z = []
        for i,Xi in enumerate(Xgen): 
            if self.method=='<t':
                self.transformer.fit(Xi.iloc[:-1])
            elif self.method=='<=t':
                self.transformer.fit(Xi)
            else: 
                raise NotImplementedError(self.method)
            Xil = Xi.iloc[[-1]]
            Zil = self.transformer.transform(Xil)
            Z.append(Zil.tolist()[0])
            if verbose==1: 
                print('Progress: %0.2f%%'%((i+1)*100./(len(X)-self.n_roll)),end='\r')
        Z = pd.DataFrame(Z,index=ix,columns=X.columns)
        return Z 

The class can be initialized with any transformer that has a .fit_transform() method.

Running the code on a feature maxtrix (1000 rows, 172 features):


wft = WalkForwardTransformer(MinMaxScaler(),method='<=t',n_roll=100)

t0 = perf_counter()
Z_train = wft.transform(X_train.iloc[:1000],verbose=1)
print('Time: %ds'%(perf_counter()-t0))

Z_train[Z_train.columns[0]].plot()

方法='<=t'

and running with method='<t':

方法='<t'

The time it takes to run (with MinMaxScaler()) is about 4s per 1000 points. So, this will be quite slow on large datasets, as explained above. However, it should be sufficient to preprocess the data in a walkforward manner.

The difference between method <=t and <t is that for the first one, we fit the scaler on data up to and including t and then apply to the point at t, while for the second once, we fit to data up to but not including t and scale on t. Thus, in case of the minmaxscaler, clearly, for <=t the values are bounded in [0,1] while for <t this doesnt have to be the case.

checkout my repo seglearn, it can do this for you https://github.com/dmbee/seglearn

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM