简体   繁体   English

将两个不同的 Sklearn 分类器应用于相同数据的两个不同子集

[英]Apply two different Sklearn classifiers to two different subsets of the same data

I have a dataset that I need to run through a classification Pipeline.我有一个需要通过分类管道运行的数据集。 The dataset has 2 types of rows:数据集有两种类型的行:

  • described: description column POPULATED描述: description列已填充
  • non-desribed: description column EMPTYdescription description列 EMPTY

I want to apply one classifier targetting ONLY the described data, and another one for the non-described data.我想应用一个仅针对描述的数据的分类器,另一个针对未描述的数据。

I am currently doing so by separating the dataset, and then preprocessing and feeding the dataset with their corresponding classifier separately.我目前这样做是通过分离数据集,然后分别预处理和馈送数据集及其相应的分类器。 What I want to accomplish is fitting this process into a Sklearn pipeline.我想要完成的是将这个过程融入到 Sklearn 管道中。 It should be something like this:它应该是这样的:

classifierPipe = Pipeline([('preproc_described', DescPreprocessor),
                           ('preproc_non_described', NonDescPreprocessor),
                           ('clf_described', CLF1),
                           ('clf_described', CLF2)
                          ])

classifierPipe.fit(X_train,y_train)

I was reviewing StackingClassifier , but according to the documentation, initial estimators are applied to all the rows in the dataset.我正在审查StackingClassifier ,但根据文档,初始估计器应用于数据集中的所有行。

How can I create such a pipeline where each classifier targets a specific subset of the whole dataset?如何创建这样一个管道,其中每个分类器都针对整个数据集的特定子集?

Why not just create two different datasets and use one classifier on each.为什么不创建两个不同的数据集并在每个数据集上使用一个分类器。 A simple code like below should be sufficient像下面这样的简单代码就足够了

 import pandas as pd df = pd.read_cvs('csv_name.csv') #drop each column in the resp dataset for_clf_1 = df.drop(['described'],axis = 1) for_clf_2 = df.drop(['not described'], axis =1)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM