简体   繁体   English

有没有办法组合这些 sklearn Pipelines/ColumnTransformers,这样我就不必进行多次 fit_transform() 调用?

[英]Is there a way to combine these sklearn Pipelines/ColumnTransformers so I don't have to make multiple fit_transform() calls?

I'd like to create a Pipeline where I can call fit_transform() just one time on my train dataset (train_df) and receive a fully preprocessed dataset.我想创建一个管道,我可以在其中对我的训练数据集 (train_df) 调用fit_transform() ) 并接收一个完全预处理的数据集。 I don't think I can currently do that, however, because I have to call PCA() on the output of a ColumnTransformer and then concatenate that output with the result of a separate ColumnTransformer called on train_df.但是,我认为我目前无法做到这一点,因为我必须在 ColumnTransformer 的输出上调用 PCA(),然后将该输出与在 train_df 上调用的单独 ColumnTransformer 的结果连接起来。 Basically, I think I'm going too high up the abstraction ladder, with one too many pipelines/ct's embedded within each other.基本上,我认为我在抽象阶梯上走得太高了,有太多的管道/ct 相互嵌入。 There's no way to streamline the entire preprocessing process by passing train_df to a single Pipeline or ColumnTransformer - unless I'm missing something and you have any insight?没有办法通过将 train_df 传递给单个 Pipeline 或 ColumnTransformer 来简化整个预处理过程 - 除非我遗漏了什么并且你有任何见解? I've spent hours wracking myself around this problem and have finally faced the reality I'm just spinning my wheels.我花了好几个小时来解决这个问题,终于面对了我只是在空转的现实。 Any help or solutions would be greatly appreciated.任何帮助或解决方案将不胜感激。

Thank you!谢谢!

num_ct = ColumnTransformer([
                        ('non_skewed_num', non_skewed_num_pipe, non_skewed_vars),
                        ('skewed_num', skewed_num_pipe, skewed_vars)
                        ], remainder='drop')

total_num_pipe = Pipeline([('num_ct', num_ct), 
                           ('dim_reduc', PCA(n_components=5))])


cat_ct = ColumnTransformer([
                        ('cat_pipe1', cat_pipe1, cat_vars1),
                        ('cat_pipe2', cat_pipe2, cat_vars2)
                        ], remainder='drop')


final_num = total_num_pipe.fit_transform(train_df)
final_cat = cat_ct.fit_transform(train_df)
final_X_train = np.c_[final_num, final_cat]

I finally found a solution to this, thanks to @Alexander's suggestion of chaining ColumnTransformers into a Pipeline.由于@Alexander 建议将 ColumnTransformers 链接到管道中,我终于找到了解决方案。 (TLDR: Don't forget that you can create a Pipeline of ColumnTransformers, using remainder='passthrough' to your advantage.) (TLDR:不要忘记您可以创建 ColumnTransformer 的管道,使用 remainder='passthrough' 对您有利。)

I first created a ColumnTransformer that concatenates the transformations for both numeric and categorical variables, but without the PCA.我首先创建了一个 ColumnTransformer,它连接了数字变量分类变量的转换,但没有 PCA。

ct = ColumnTransformer([
                        ('non_skewed_num', non_skewed_num_pipe, non_skewed_vars),
                        ('skewed_num', skewed_num_pipe, skewed_vars),
                        ('cat_pipe1', cat_pipe1, cat_vars1),
                        ('cat_pipe2', cat_pipe2, cat_vars2)
                        ], remainder='drop')

Then, I created a ColumnTransformer just for the PCA, and when I specified which columns to apply this to, I used a slice object since this ColumnTransformer will be fed a NumPy array--not a DataFrame--in the eventual Pipeline (it will be the second ColumnTransformer in the Pipeline).然后,我为 PCA 创建一个ColumnTransformer ,当我指定要将其应用到哪些列时,我使用了一个切片对象,因为这个ColumnTransformer将在最终的管道中被提供一个 NumPy 数组——而不是 DataFrame(它将成为管道中的第二个ColumnTransformer )。 I also set remainder='passthrough' so the non-numeric variables will be retained untransformed after the PCA.我还设置了 remainder='passthrough',这样非数字变量将在 PCA 之后保持不变。

ct2 = ColumnTransformer([('dim_reduc', PCA(n_components=5), slice(0, 37))], remainder='passthrough')  # 37 is number of numeric variables

Finally, I created a Pipeline chaining these two ColumnTransformers最后,我创建了一个链接这两个ColumnTransformers的管道

final_pipe = Pipeline([('ct', ct), 
                       ('ct2', ct2)])

Calling final_pipe.fit_transform(train_df) yields the cleaned array I wanted.调用final_pipe.fit_transform(train_df)产生我想要的已清理数组。 Hope this helps!希望这可以帮助!

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 sklearn 中的 ColumnTransformer 实现没有定义 fit 方法,它只是自动调用 fit_transform? - ColumnTransformer implementation in sklearn doesn't have a fit method defined, it just automatically calls fit_transform? 为什么fit_transform在此sklearn Pipeline示例中不起作用? - Why doesn't fit_transform work in this sklearn Pipeline example? 矢量化fit_transform如何在sklearn中工作? - How vectorizer fit_transform work in sklearn? sklearn countvectorizer 中的 fit_transform 和 transform 有什么区别? - What is the difference between fit_transform and transform in sklearn countvectorizer? sklearn中的'transform'和'fit_transform'有什么区别 - what is the difference between 'transform' and 'fit_transform' in sklearn Python sklearn:fit_transform()不适用于GridSearchCV - Python sklearn : fit_transform() does not work for GridSearchCV sklearn SVD fit_transform函数的输入数据类型 - Input data type for sklearn SVD fit_transform function 如何在两列上使用 sklearn TfidfVectorizer fit_transform - How to use sklearn TfidfVectorizer fit_transform on two columns sklearn PCA fit_transform() 是否以输入变量为中心? - Does sklearn PCA fit_transform() center input variables? 在fit_transform之后获取sklearn.LabelEncoder()映射 - get sklearn.LabelEncoder() mappings after fit_transform
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM