sklearn 中的 ColumnTransformer 实现没有定义 fit 方法，它只是自动调用 fit_transform？

Question

My data grows to a size too large to handle in memory when I'm applying a Sliding Window Algorithm, but it is small enough that I can call ColumnTransformer's fit method without any issues.当我应用滑动 Window 算法时，我的数据增长到太大而无法在 memory 中处理，但它足够小，我可以毫无问题地调用ColumnTransformer 的 fit 方法。 Thus, my desired workflow is:因此，我想要的工作流程是：

Fit the entire data for the MinMaxScaler()拟合MinMaxScaler()的全部数据
Transform batches of the data for the SlidingWindowAlgorithm()为SlidingWindowAlgorithm()转换数据批次

Problem is that it seems that the fit method doesn't exist in ColumnTransformer , looking at the source code shows this:问题是ColumnTransformer中似乎不存在 fit 方法，查看源代码表明：

def fit(self, X, y=None):
    # we use fit_transform to make sure to set sparse_output_ (for which we
    # need the transformed data) to have consistent output type in predict
    self.fit_transform(X, y=y)
    return self

I don't understand their reasoning (don't know the purpose of sparse_output_ ).我不明白他们的推理（不知道sparse_output_的目的）。

Is there a way that I can fit my data without transforming?有没有一种方法可以在不转换的情况下适应我的数据？ I'm not using a sparse matrix for what it's worth, just a regular numpy one.我没有为它的价值使用稀疏矩阵，只是一个普通的numpy 。

Here is my code, data_in is set to give you 500MBs of RAM.这是我的代码， data_in设置为给你 500MB 的 RAM。 And as you increase the value of window_size an incredible spike in RAM will be detected.当您增加window_size的值时，将检测到 RAM 中令人难以置信的峰值。 (I need window_size to be 60). （我需要window_size为 60）。

## Part 0: Starting
import numpy as np
import pandas as pd

from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import MinMaxScaler
from sklearn.preprocessing import StandardScaler

data_in = np.random.rand(10*10**6,7) # This will take 500MB of RAM
window_size = 1 # Change this value!

## Part 1: Creating Transformers!
class SlidingWindowX(BaseEstimator, TransformerMixin):
    def __init__(self, window_size):
        self.window_size = window_size

    def fit(self, X, y=None):
        return self

    def transform(self, X, y=None):
        '''
        Creates a sliding window over an input that has the shape of
        (rows, features) for X
        '''
        if X.ndim == 1:
            X = X.reshape(-1, 1)

        row_size = X.shape[0]
        X_out = np.zeros((row_size-2*self.window_size, 1))

        for j in range(X.shape[1]):
            for i in range(self.window_size):
                idx1 = i
                idx2 = row_size-2*self.window_size+i
                X_out = np.concatenate((X_out, X[idx1:idx2, j].reshape(-1, 1)), axis=1)

        return X_out[:, 1:]

## Part 2: Making pipelines!
attribs_elec = np.arange(0, 7)


pipe_elec = Pipeline([
    ('min-max', MinMaxScaler()),
    ('window', SlidingWindowX(window_size))
])


pipe_full = ColumnTransformer([
    ("elec", pipe_elec, attribs_elec),
])

pipe_full.fit(data_in)

Answer 1

To answer your question回答你的问题

Is there a way that I can fit my data without transforming?有没有一种方法可以在不转换的情况下适应我的数据？

Yes, you can just call the fit_transform() method with your data and it will fit the ColumnTransformer .是的，您只需用您的数据调用fit_transform()方法，它就会适合ColumnTransformer 。 Your data will not be transformed, since the input data remains unchanged and the transformed data is only returned via the output (which you don't have to save).您的数据不会被转换，因为输入数据保持不变并且转换后的数据仅通过 output（您不必保存）返回。

Since your code can not be run as you mentioned (lack of data), I have this example here from the documentation :由于您的代码无法像您提到的那样运行（缺少数据），因此我在文档中找到了这个示例：

from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import Normalizer
ct = ColumnTransformer(
    [("norm1", Normalizer(norm='l1'), [0, 1]),
     ("norm2", Normalizer(norm='l1'), slice(2, 4))])
X = np.array([[0., 1., 2., 2.],
              [1., 1., 0., 1.]])

Now you can call fit_transform() :现在您可以调用fit_transform() ：

ct.fit_transform(X)

which returns the transformed data:返回转换后的数据：

array([[0. , 1. , 0.5, 0.5],
       [0.5, 0.5, 0. , 1. ]])

But this is not being stored and therefore the data (X) remains the same:但这并没有被存储，因此数据 (X) 保持不变：

Output: Output：

array([[0., 1., 2., 2.],
       [1., 1., 0., 1.]])

sklearn 中的 ColumnTransformer 实现没有定义 fit 方法，它只是自动调用 fit_transform？

问题描述

1 个解决方案

解决方案1
0 2020-09-24 10:35:59

sklearn 中的 ColumnTransformer 实现没有定义 fit 方法，它只是自动调用 fit_transform？

问题描述

1 个解决方案

解决方案1 0 2020-09-24 10:35:59

解决方案1
0 2020-09-24 10:35:59