有没有办法组合这些 sklearn Pipelines/ColumnTransformers，这样我就不必进行多次 fit_transform() 调用？

Question

I'd like to create a Pipeline where I can call fit_transform() just one time on my train dataset (train_df) and receive a fully preprocessed dataset.我想创建一个管道，我可以在其中对我的训练数据集 (train_df) 调用fit_transform() ) 并接收一个完全预处理的数据集。 I don't think I can currently do that, however, because I have to call PCA() on the output of a ColumnTransformer and then concatenate that output with the result of a separate ColumnTransformer called on train_df.但是，我认为我目前无法做到这一点，因为我必须在 ColumnTransformer 的输出上调用 PCA()，然后将该输出与在 train_df 上调用的单独 ColumnTransformer 的结果连接起来。 Basically, I think I'm going too high up the abstraction ladder, with one too many pipelines/ct's embedded within each other.基本上，我认为我在抽象阶梯上走得太高了，有太多的管道/ct 相互嵌入。 There's no way to streamline the entire preprocessing process by passing train_df to a single Pipeline or ColumnTransformer - unless I'm missing something and you have any insight?没有办法通过将 train_df 传递给单个 Pipeline 或 ColumnTransformer 来简化整个预处理过程 - 除非我遗漏了什么并且你有任何见解？ I've spent hours wracking myself around this problem and have finally faced the reality I'm just spinning my wheels.我花了好几个小时来解决这个问题，终于面对了我只是在空转的现实。 Any help or solutions would be greatly appreciated.任何帮助或解决方案将不胜感激。

Thank you!谢谢！

num_ct = ColumnTransformer([
                        ('non_skewed_num', non_skewed_num_pipe, non_skewed_vars),
                        ('skewed_num', skewed_num_pipe, skewed_vars)
                        ], remainder='drop')

total_num_pipe = Pipeline([('num_ct', num_ct), 
                           ('dim_reduc', PCA(n_components=5))])


cat_ct = ColumnTransformer([
                        ('cat_pipe1', cat_pipe1, cat_vars1),
                        ('cat_pipe2', cat_pipe2, cat_vars2)
                        ], remainder='drop')


final_num = total_num_pipe.fit_transform(train_df)
final_cat = cat_ct.fit_transform(train_df)
final_X_train = np.c_[final_num, final_cat]

Answer 1

I finally found a solution to this, thanks to @Alexander's suggestion of chaining ColumnTransformers into a Pipeline.由于@Alexander 建议将 ColumnTransformers 链接到管道中，我终于找到了解决方案。 (TLDR: Don't forget that you can create a Pipeline of ColumnTransformers, using remainder='passthrough' to your advantage.) （TLDR：不要忘记您可以创建 ColumnTransformer 的管道，使用 remainder='passthrough' 对您有利。）

I first created a ColumnTransformer that concatenates the transformations for both numeric and categorical variables, but without the PCA.我首先创建了一个 ColumnTransformer，它连接了数字变量和分类变量的转换，但没有 PCA。

ct = ColumnTransformer([
                        ('non_skewed_num', non_skewed_num_pipe, non_skewed_vars),
                        ('skewed_num', skewed_num_pipe, skewed_vars),
                        ('cat_pipe1', cat_pipe1, cat_vars1),
                        ('cat_pipe2', cat_pipe2, cat_vars2)
                        ], remainder='drop')

Then, I created a ColumnTransformer just for the PCA, and when I specified which columns to apply this to, I used a slice object since this ColumnTransformer will be fed a NumPy array--not a DataFrame--in the eventual Pipeline (it will be the second ColumnTransformer in the Pipeline).然后，我为 PCA 创建了一个ColumnTransformer ，当我指定要将其应用到哪些列时，我使用了一个切片对象，因为这个ColumnTransformer将在最终的管道中被提供一个 NumPy 数组——而不是 DataFrame（它将成为管道中的第二个ColumnTransformer ）。 I also set remainder='passthrough' so the non-numeric variables will be retained untransformed after the PCA.我还设置了 remainder='passthrough'，这样非数字变量将在 PCA 之后保持不变。

ct2 = ColumnTransformer([('dim_reduc', PCA(n_components=5), slice(0, 37))], remainder='passthrough')  # 37 is number of numeric variables

Finally, I created a Pipeline chaining these two ColumnTransformers最后，我创建了一个链接这两个ColumnTransformers的管道

final_pipe = Pipeline([('ct', ct), 
                       ('ct2', ct2)])

Calling final_pipe.fit_transform(train_df) yields the cleaned array I wanted.调用final_pipe.fit_transform(train_df)产生我想要的已清理数组。 Hope this helps!希望这可以帮助！

有没有办法组合这些 sklearn Pipelines/ColumnTransformers，这样我就不必进行多次 fit_transform() 调用？

问题描述

1 个解决方案

解决方案1
0 2022-12-16 19:19:45

有没有办法组合这些 sklearn Pipelines/ColumnTransformers，这样我就不必进行多次 fit_transform() 调用？

问题描述

1 个解决方案

解决方案1 0 2022-12-16 19:19:45

解决方案1
0 2022-12-16 19:19:45