将两个不同的 Sklearn 分类器应用于相同数据的两个不同子集

Question

I have a dataset that I need to run through a classification Pipeline.我有一个需要通过分类管道运行的数据集。 The dataset has 2 types of rows:数据集有两种类型的行：

described: description column POPULATED描述： description列已填充
non-desribed: description column EMPTY非description ： description列 EMPTY

I want to apply one classifier targetting ONLY the described data, and another one for the non-described data.我想应用一个仅针对描述的数据的分类器，另一个针对未描述的数据。

I am currently doing so by separating the dataset, and then preprocessing and feeding the dataset with their corresponding classifier separately.我目前这样做是通过分离数据集，然后分别预处理和馈送数据集及其相应的分类器。 What I want to accomplish is fitting this process into a Sklearn pipeline.我想要完成的是将这个过程融入到 Sklearn 管道中。 It should be something like this:它应该是这样的：

classifierPipe = Pipeline([('preproc_described', DescPreprocessor),
                           ('preproc_non_described', NonDescPreprocessor),
                           ('clf_described', CLF1),
                           ('clf_described', CLF2)
                          ])

classifierPipe.fit(X_train,y_train)

I was reviewing StackingClassifier , but according to the documentation, initial estimators are applied to all the rows in the dataset.我正在审查StackingClassifier ，但根据文档，初始估计器应用于数据集中的所有行。

How can I create such a pipeline where each classifier targets a specific subset of the whole dataset?如何创建这样一个管道，其中每个分类器都针对整个数据集的特定子集？

Answer 1

Why not just create two different datasets and use one classifier on each.为什么不创建两个不同的数据集并在每个数据集上使用一个分类器。 A simple code like below should be sufficient像下面这样的简单代码就足够了

 import pandas as pd df = pd.read_cvs('csv_name.csv') #drop each column in the resp dataset for_clf_1 = df.drop(['described'],axis = 1) for_clf_2 = df.drop(['not described'], axis =1)

将两个不同的 Sklearn 分类器应用于相同数据的两个不同子集

问题描述

1 个解决方案

解决方案1
0 2020-09-03 13:19:21

将两个不同的 Sklearn 分类器应用于相同数据的两个不同子集

问题描述

1 个解决方案

解决方案1 0 2020-09-03 13:19:21

解决方案1
0 2020-09-03 13:19:21