简体   繁体   English

sklearn 中的 ColumnTransformer 实现没有定义 fit 方法,它只是自动调用 fit_transform?

[英]ColumnTransformer implementation in sklearn doesn't have a fit method defined, it just automatically calls fit_transform?

My data grows to a size too large to handle in memory when I'm applying a Sliding Window Algorithm, but it is small enough that I can call ColumnTransformer's fit method without any issues.当我应用滑动 Window 算法时,我的数据增长到太大而无法在 memory 中处理,但它足够小,我可以毫无问题地调用ColumnTransformer 的 fit 方法 Thus, my desired workflow is:因此,我想要的工作流程是:

  1. Fit the entire data for the MinMaxScaler()拟合MinMaxScaler()的全部数据
  2. Transform batches of the data for the SlidingWindowAlgorithm()SlidingWindowAlgorithm()转换数据批次

Problem is that it seems that the fit method doesn't exist in ColumnTransformer , looking at the source code shows this:问题是ColumnTransformer中似乎不存在 fit 方法,查看源代码表明:

def fit(self, X, y=None):
    # we use fit_transform to make sure to set sparse_output_ (for which we
    # need the transformed data) to have consistent output type in predict
    self.fit_transform(X, y=y)
    return self

I don't understand their reasoning (don't know the purpose of sparse_output_ ).我不明白他们的推理(不知道sparse_output_的目的)。

Is there a way that I can fit my data without transforming?有没有一种方法可以在不转换的情况下适应我的数据? I'm not using a sparse matrix for what it's worth, just a regular numpy one.我没有为它的价值使用稀疏矩阵,只是一个普通的numpy

Here is my code, data_in is set to give you 500MBs of RAM.这是我的代码, data_in设置为给你 500MB 的 RAM。 And as you increase the value of window_size an incredible spike in RAM will be detected.当您增加window_size的值时,将检测到 RAM 中令人难以置信的峰值。 (I need window_size to be 60). (我需要window_size为 60)。

## Part 0: Starting
import numpy as np
import pandas as pd

from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import MinMaxScaler
from sklearn.preprocessing import StandardScaler

data_in = np.random.rand(10*10**6,7) # This will take 500MB of RAM
window_size = 1 # Change this value!

## Part 1: Creating Transformers!
class SlidingWindowX(BaseEstimator, TransformerMixin):
    def __init__(self, window_size):
        self.window_size = window_size

    def fit(self, X, y=None):
        return self

    def transform(self, X, y=None):
        '''
        Creates a sliding window over an input that has the shape of
        (rows, features) for X
        '''
        if X.ndim == 1:
            X = X.reshape(-1, 1)

        row_size = X.shape[0]
        X_out = np.zeros((row_size-2*self.window_size, 1))

        for j in range(X.shape[1]):
            for i in range(self.window_size):
                idx1 = i
                idx2 = row_size-2*self.window_size+i
                X_out = np.concatenate((X_out, X[idx1:idx2, j].reshape(-1, 1)), axis=1)

        return X_out[:, 1:]

## Part 2: Making pipelines!
attribs_elec = np.arange(0, 7)


pipe_elec = Pipeline([
    ('min-max', MinMaxScaler()),
    ('window', SlidingWindowX(window_size))
])


pipe_full = ColumnTransformer([
    ("elec", pipe_elec, attribs_elec),
])

pipe_full.fit(data_in)

To answer your question回答你的问题

Is there a way that I can fit my data without transforming?有没有一种方法可以在不转换的情况下适应我的数据?

Yes, you can just call the fit_transform() method with your data and it will fit the ColumnTransformer .是的,您只需用您的数据调用fit_transform()方法,它就会适合ColumnTransformer Your data will not be transformed, since the input data remains unchanged and the transformed data is only returned via the output (which you don't have to save).您的数据不会被转换,因为输入数据保持不变并且转换后的数据仅通过 output(您不必保存)返回。

Since your code can not be run as you mentioned (lack of data), I have this example here from the documentation :由于您的代码无法像您提到的那样运行(缺少数据),因此我在文档中找到了这个示例:

from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import Normalizer
ct = ColumnTransformer(
    [("norm1", Normalizer(norm='l1'), [0, 1]),
     ("norm2", Normalizer(norm='l1'), slice(2, 4))])
X = np.array([[0., 1., 2., 2.],
              [1., 1., 0., 1.]])

Now you can call fit_transform() :现在您可以调用fit_transform()

ct.fit_transform(X)

which returns the transformed data:返回转换后的数据:

array([[0. , 1. , 0.5, 0.5],
       [0.5, 0.5, 0. , 1. ]])

But this is not being stored and therefore the data (X) remains the same:但这并没有被存储,因此数据 (X) 保持不变:

X

Output: Output:

array([[0., 1., 2., 2.],
       [1., 1., 0., 1.]])

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 ColumnTransformer 在 sklearn 中尝试 fit_transform 管道时生成 TypeError - ColumnTransformer generating a TypeError when trying to fit_transform pipeline in sklearn 有没有办法组合这些 sklearn Pipelines/ColumnTransformers,这样我就不必进行多次 fit_transform() 调用? - Is there a way to combine these sklearn Pipelines/ColumnTransformers so I don't have to make multiple fit_transform() calls? 为什么fit_transform在此sklearn Pipeline示例中不起作用? - Why doesn't fit_transform work in this sklearn Pipeline example? ColumnTransformer fit_transform 不适用于管道 - ColumnTransformer fit_transform not working with pipeline sklearn.compose.ColumnTransformer:fit_transform() 需要 2 个位置参数,但给出了 3 个 - sklearn.compose.ColumnTransformer: fit_transform() takes 2 positional arguments but 3 were given 矢量化fit_transform如何在sklearn中工作? - How vectorizer fit_transform work in sklearn? 我们必须对 fit_transform 方法使用什么公式? - What formula we have to use for fit_transform method? 有什么理由做.fit()和.transform()而不是just.fit_transform()? - Is there any reason to do .fit() and .transform() instead of just .fit_transform()? sklearn countvectorizer 中的 fit_transform 和 transform 有什么区别? - What is the difference between fit_transform and transform in sklearn countvectorizer? sklearn中的'transform'和'fit_transform'有什么区别 - what is the difference between 'transform' and 'fit_transform' in sklearn
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM