[英]Is there a way to combine these sklearn Pipelines/ColumnTransformers so I don't have to make multiple fit_transform() calls?
I'd like to create a Pipeline where I can call fit_transform()
just one time on my train dataset (train_df) and receive a fully preprocessed dataset.我想创建一个管道,我可以在其中对我的训练数据集 (train_df) 调用fit_transform()
) 并接收一个完全预处理的数据集。 I don't think I can currently do that, however, because I have to call PCA() on the output of a ColumnTransformer and then concatenate that output with the result of a separate ColumnTransformer called on train_df.但是,我认为我目前无法做到这一点,因为我必须在 ColumnTransformer 的输出上调用 PCA(),然后将该输出与在 train_df 上调用的单独 ColumnTransformer 的结果连接起来。 Basically, I think I'm going too high up the abstraction ladder, with one too many pipelines/ct's embedded within each other.基本上,我认为我在抽象阶梯上走得太高了,有太多的管道/ct 相互嵌入。 There's no way to streamline the entire preprocessing process by passing train_df to a single Pipeline or ColumnTransformer - unless I'm missing something and you have any insight?没有办法通过将 train_df 传递给单个 Pipeline 或 ColumnTransformer 来简化整个预处理过程 - 除非我遗漏了什么并且你有任何见解? I've spent hours wracking myself around this problem and have finally faced the reality I'm just spinning my wheels.我花了好几个小时来解决这个问题,终于面对了我只是在空转的现实。 Any help or solutions would be greatly appreciated.任何帮助或解决方案将不胜感激。
Thank you!谢谢!
num_ct = ColumnTransformer([
('non_skewed_num', non_skewed_num_pipe, non_skewed_vars),
('skewed_num', skewed_num_pipe, skewed_vars)
], remainder='drop')
total_num_pipe = Pipeline([('num_ct', num_ct),
('dim_reduc', PCA(n_components=5))])
cat_ct = ColumnTransformer([
('cat_pipe1', cat_pipe1, cat_vars1),
('cat_pipe2', cat_pipe2, cat_vars2)
], remainder='drop')
final_num = total_num_pipe.fit_transform(train_df)
final_cat = cat_ct.fit_transform(train_df)
final_X_train = np.c_[final_num, final_cat]
I finally found a solution to this, thanks to @Alexander's suggestion of chaining ColumnTransformers into a Pipeline.由于@Alexander 建议将 ColumnTransformers 链接到管道中,我终于找到了解决方案。 (TLDR: Don't forget that you can create a Pipeline of ColumnTransformers, using remainder='passthrough' to your advantage.) (TLDR:不要忘记您可以创建 ColumnTransformer 的管道,使用 remainder='passthrough' 对您有利。)
I first created a ColumnTransformer that concatenates the transformations for both numeric and categorical variables, but without the PCA.我首先创建了一个 ColumnTransformer,它连接了数字变量和分类变量的转换,但没有 PCA。
ct = ColumnTransformer([
('non_skewed_num', non_skewed_num_pipe, non_skewed_vars),
('skewed_num', skewed_num_pipe, skewed_vars),
('cat_pipe1', cat_pipe1, cat_vars1),
('cat_pipe2', cat_pipe2, cat_vars2)
], remainder='drop')
Then, I created a ColumnTransformer
just for the PCA, and when I specified which columns to apply this to, I used a slice object since this ColumnTransformer
will be fed a NumPy array--not a DataFrame--in the eventual Pipeline (it will be the second ColumnTransformer
in the Pipeline).然后,我为 PCA 创建了一个ColumnTransformer
,当我指定要将其应用到哪些列时,我使用了一个切片对象,因为这个ColumnTransformer
将在最终的管道中被提供一个 NumPy 数组——而不是 DataFrame(它将成为管道中的第二个ColumnTransformer
)。 I also set remainder='passthrough' so the non-numeric variables will be retained untransformed after the PCA.我还设置了 remainder='passthrough',这样非数字变量将在 PCA 之后保持不变。
ct2 = ColumnTransformer([('dim_reduc', PCA(n_components=5), slice(0, 37))], remainder='passthrough') # 37 is number of numeric variables
Finally, I created a Pipeline chaining these two ColumnTransformers
最后,我创建了一个链接这两个ColumnTransformers
的管道
final_pipe = Pipeline([('ct', ct),
('ct2', ct2)])
Calling final_pipe.fit_transform(train_df)
yields the cleaned array I wanted.调用final_pipe.fit_transform(train_df)
产生我想要的已清理数组。 Hope this helps!希望这可以帮助!
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.