[英]ColumnTransformer implementation in sklearn doesn't have a fit method defined, it just automatically calls fit_transform?
My data grows to a size too large to handle in memory when I'm applying a Sliding Window Algorithm, but it is small enough that I can call ColumnTransformer's fit method without any issues.当我应用滑动 Window 算法时,我的数据增长到太大而无法在 memory 中处理,但它足够小,我可以毫无问题地调用ColumnTransformer 的 fit 方法。 Thus, my desired workflow is:因此,我想要的工作流程是:
MinMaxScaler()
拟合MinMaxScaler()
的全部数据SlidingWindowAlgorithm()
为SlidingWindowAlgorithm()
转换数据批次Problem is that it seems that the fit method doesn't exist in ColumnTransformer
, looking at the source code shows this:问题是ColumnTransformer
中似乎不存在 fit 方法,查看源代码表明:
def fit(self, X, y=None):
# we use fit_transform to make sure to set sparse_output_ (for which we
# need the transformed data) to have consistent output type in predict
self.fit_transform(X, y=y)
return self
I don't understand their reasoning (don't know the purpose of sparse_output_
).我不明白他们的推理(不知道sparse_output_
的目的)。
Is there a way that I can fit my data without transforming?有没有一种方法可以在不转换的情况下适应我的数据? I'm not using a sparse matrix for what it's worth, just a regular numpy
one.我没有为它的价值使用稀疏矩阵,只是一个普通的numpy
。
Here is my code, data_in
is set to give you 500MBs of RAM.这是我的代码, data_in
设置为给你 500MB 的 RAM。 And as you increase the value of window_size
an incredible spike in RAM will be detected.当您增加window_size
的值时,将检测到 RAM 中令人难以置信的峰值。 (I need window_size
to be 60). (我需要window_size
为 60)。
## Part 0: Starting
import numpy as np
import pandas as pd
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import MinMaxScaler
from sklearn.preprocessing import StandardScaler
data_in = np.random.rand(10*10**6,7) # This will take 500MB of RAM
window_size = 1 # Change this value!
## Part 1: Creating Transformers!
class SlidingWindowX(BaseEstimator, TransformerMixin):
def __init__(self, window_size):
self.window_size = window_size
def fit(self, X, y=None):
return self
def transform(self, X, y=None):
'''
Creates a sliding window over an input that has the shape of
(rows, features) for X
'''
if X.ndim == 1:
X = X.reshape(-1, 1)
row_size = X.shape[0]
X_out = np.zeros((row_size-2*self.window_size, 1))
for j in range(X.shape[1]):
for i in range(self.window_size):
idx1 = i
idx2 = row_size-2*self.window_size+i
X_out = np.concatenate((X_out, X[idx1:idx2, j].reshape(-1, 1)), axis=1)
return X_out[:, 1:]
## Part 2: Making pipelines!
attribs_elec = np.arange(0, 7)
pipe_elec = Pipeline([
('min-max', MinMaxScaler()),
('window', SlidingWindowX(window_size))
])
pipe_full = ColumnTransformer([
("elec", pipe_elec, attribs_elec),
])
pipe_full.fit(data_in)
To answer your question回答你的问题
Is there a way that I can fit my data without transforming?有没有一种方法可以在不转换的情况下适应我的数据?
Yes, you can just call the fit_transform()
method with your data and it will fit the ColumnTransformer
.是的,您只需用您的数据调用fit_transform()
方法,它就会适合ColumnTransformer
。 Your data will not be transformed, since the input data remains unchanged and the transformed data is only returned via the output (which you don't have to save).您的数据不会被转换,因为输入数据保持不变并且转换后的数据仅通过 output(您不必保存)返回。
Since your code can not be run as you mentioned (lack of data), I have this example here from the documentation :由于您的代码无法像您提到的那样运行(缺少数据),因此我在文档中找到了这个示例:
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import Normalizer
ct = ColumnTransformer(
[("norm1", Normalizer(norm='l1'), [0, 1]),
("norm2", Normalizer(norm='l1'), slice(2, 4))])
X = np.array([[0., 1., 2., 2.],
[1., 1., 0., 1.]])
Now you can call fit_transform()
:现在您可以调用fit_transform()
:
ct.fit_transform(X)
which returns the transformed data:返回转换后的数据:
array([[0. , 1. , 0.5, 0.5],
[0.5, 0.5, 0. , 1. ]])
But this is not being stored and therefore the data (X) remains the same:但这并没有被存储,因此数据 (X) 保持不变:
X
Output: Output:
array([[0., 1., 2., 2.],
[1., 1., 0., 1.]])
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.