如何在python sklearn中正确使用featureUnion数值和文本功能

Question

I'm trying to use featureunion for the 1st time in sklearn pipeline to combine numerical (2 columns) and text features (1 column) for multi-class classification. 我正在尝试在sklearn管道中第一次使用featureunion来组合数字（2列）和文本特征（1列）以进行多类分类。

from sklearn.preprocessing import FunctionTransformer
from sklearn.pipeline import Pipeline
from sklearn.multiclass import OneVsRestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import FeatureUnion

get_text_data = FunctionTransformer(lambda x: x['text'], validate=False)
get_numeric_data = FunctionTransformer(lambda x: x[['num1','num2']], validate=False)

process_and_join_features = FeatureUnion(
         [
            ('numeric_features', Pipeline([
                ('selector', get_numeric_data),
                ('clf', OneVsRestClassifier(LogisticRegression()))
            ])),
             ('text_features', Pipeline([
                ('selector', get_text_data),
                ('vec', CountVectorizer()),
                ('clf', OneVsRestClassifier(LogisticRegression()))
            ]))
         ]
    )

In this code 'text' is the text columns and 'num1','num2' are 2 numeric column. 在此代码中，'text'是文本列，'num1'，'num2'是2个数字列。

The error message is 错误消息是

TypeError: All estimators should implement fit and transform. 'Pipeline(memory=None,
 steps=[('selector', FunctionTransformer(accept_sparse=False,
      func=<function <lambda> at 0x7fefa8efd840>, inv_kw_args=None,
      inverse_func=None, kw_args=None, pass_y='deprecated',
      validate=False)), ('clf', OneVsRestClassifier(estimator=LogisticRegression(C=1.0, class_weigh...=None, solver='liblinear', tol=0.0001,
      verbose=0, warm_start=False),
      n_jobs=1))])' (type <class 'sklearn.pipeline.Pipeline'>) doesn't

Any step I missed? 我错过了任何一步？

Answer 1

A FeatureUnion should be used as a step in the pipeline, not around the pipeline. FeatureUnion应该用作管道中的一个步骤，而不是管道周围。 The error you are getting is because you have a Classifier not as the final step - the union tries to call fit and transform on all transformers and a classifier does not have a transform method. 你得到的错误是因为你有一个分类器不是最后一步 - 联盟试图在所有变换器上调用fit和transform ，而分类器没有transform方法。

Simply rework to have an outer pipeline with the classifier as the final step: 简单地返工以使用分类器作为最后一步的外部管道：

process_and_join_features = Pipeline([
    ('features', FeatureUnion([
            ('numeric_features', Pipeline([
                ('selector', get_numeric_data)
            ])),
             ('text_features', Pipeline([
                ('selector', get_text_data),
                ('vec', CountVectorizer())
            ]))
         ])),
    ('clf', OneVsRestClassifier(LogisticRegression()))
])

Also see here for a good example on the scikit-learn website doing this sort of thing. 还可以在这里看到scikit-learn网站做这类事情的一个很好的例子。

Answer 2

While I believe @Ken Syme correctly identified the problem and provided a fix for what you intend to do. 虽然我相信@Ken Syme正确地发现了问题并为你打算做什么提供了解决方案。 However, just in case you actually intend to use the output of the classifier as a feature for a higher level model, check out this blog . 但是，如果您确实打算将分类器的输出用作更高级别模型的功能，请查看此博客。

Using the ModelTransformer by Zac, you can have your pipe as follows: 使用Zac的ModelTransformer，您可以按如下方式管道：

class ModelTransformer(TransformerMixin):

    def __init__(self, model):
        self.model = model

    def fit(self, *args, **kwargs):
        self.model.fit(*args, **kwargs)
        return self

    def transform(self, X, **transform_params):
        return DataFrame(self.model.predict(X))


process_and_join_features = FeatureUnion(
         [
            ('numeric_features', Pipeline([
                ('selector', get_numeric_data),
                ('clf', ModelTransformer(OneVsRestClassifier(LogisticRegression())))
            ])),
             ('text_features', Pipeline([
                ('selector', get_text_data),
                ('vec', CountVectorizer()),
                ('clf', ModelTransformer(OneVsRestClassifier(LogisticRegression())))
            ]))
         ]
)

Depending on your concrete next steps you still may have to wrap the FeatureUnion in a Pipeline (eg using the shortcut make_pipeline ). 根据具体的后续步骤，您仍可能需要将FeatureUnion包装在管道中（例如，使用快捷方式make_pipeline ）。

如何在python sklearn中正确使用featureUnion数值和文本功能

问题描述

2 个解决方案

解决方案1
9 已采纳 2017-12-11 09:49:48

解决方案2
6 2017-12-11 10:43:00

如何在python sklearn中正确使用featureUnion数值和文本功能

问题描述

2 个解决方案

解决方案1 9 已采纳 2017-12-11 09:49:48

解决方案2 6 2017-12-11 10:43:00

解决方案1
9 已采纳 2017-12-11 09:49:48

解决方案2
6 2017-12-11 10:43:00