简体   繁体   中英

ColumnTransformer implementation in sklearn doesn't have a fit method defined, it just automatically calls fit_transform?

My data grows to a size too large to handle in memory when I'm applying a Sliding Window Algorithm, but it is small enough that I can call ColumnTransformer's fit method without any issues. Thus, my desired workflow is:

  1. Fit the entire data for the MinMaxScaler()
  2. Transform batches of the data for the SlidingWindowAlgorithm()

Problem is that it seems that the fit method doesn't exist in ColumnTransformer , looking at the source code shows this:

def fit(self, X, y=None):
    # we use fit_transform to make sure to set sparse_output_ (for which we
    # need the transformed data) to have consistent output type in predict
    self.fit_transform(X, y=y)
    return self

I don't understand their reasoning (don't know the purpose of sparse_output_ ).

Is there a way that I can fit my data without transforming? I'm not using a sparse matrix for what it's worth, just a regular numpy one.

Here is my code, data_in is set to give you 500MBs of RAM. And as you increase the value of window_size an incredible spike in RAM will be detected. (I need window_size to be 60).

## Part 0: Starting
import numpy as np
import pandas as pd

from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import MinMaxScaler
from sklearn.preprocessing import StandardScaler

data_in = np.random.rand(10*10**6,7) # This will take 500MB of RAM
window_size = 1 # Change this value!

## Part 1: Creating Transformers!
class SlidingWindowX(BaseEstimator, TransformerMixin):
    def __init__(self, window_size):
        self.window_size = window_size

    def fit(self, X, y=None):
        return self

    def transform(self, X, y=None):
        '''
        Creates a sliding window over an input that has the shape of
        (rows, features) for X
        '''
        if X.ndim == 1:
            X = X.reshape(-1, 1)

        row_size = X.shape[0]
        X_out = np.zeros((row_size-2*self.window_size, 1))

        for j in range(X.shape[1]):
            for i in range(self.window_size):
                idx1 = i
                idx2 = row_size-2*self.window_size+i
                X_out = np.concatenate((X_out, X[idx1:idx2, j].reshape(-1, 1)), axis=1)

        return X_out[:, 1:]

## Part 2: Making pipelines!
attribs_elec = np.arange(0, 7)


pipe_elec = Pipeline([
    ('min-max', MinMaxScaler()),
    ('window', SlidingWindowX(window_size))
])


pipe_full = ColumnTransformer([
    ("elec", pipe_elec, attribs_elec),
])

pipe_full.fit(data_in)

To answer your question

Is there a way that I can fit my data without transforming?

Yes, you can just call the fit_transform() method with your data and it will fit the ColumnTransformer . Your data will not be transformed, since the input data remains unchanged and the transformed data is only returned via the output (which you don't have to save).

Since your code can not be run as you mentioned (lack of data), I have this example here from the documentation :

from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import Normalizer
ct = ColumnTransformer(
    [("norm1", Normalizer(norm='l1'), [0, 1]),
     ("norm2", Normalizer(norm='l1'), slice(2, 4))])
X = np.array([[0., 1., 2., 2.],
              [1., 1., 0., 1.]])

Now you can call fit_transform() :

ct.fit_transform(X)

which returns the transformed data:

array([[0. , 1. , 0.5, 0.5],
       [0.5, 0.5, 0. , 1. ]])

But this is not being stored and therefore the data (X) remains the same:

X

Output:

array([[0., 1., 2., 2.],
       [1., 1., 0., 1.]])

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM