简体   繁体   English

如何在python sklearn中正确使用featureUnion数值和文本功能

[英]how to featureUnion numerical and text features in python sklearn properly

I'm trying to use featureunion for the 1st time in sklearn pipeline to combine numerical (2 columns) and text features (1 column) for multi-class classification. 我正在尝试在sklearn管道中第一次使用featureunion来组合数字(2列)和文本特征(1列)以进行多类分类。

from sklearn.preprocessing import FunctionTransformer
from sklearn.pipeline import Pipeline
from sklearn.multiclass import OneVsRestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import FeatureUnion

get_text_data = FunctionTransformer(lambda x: x['text'], validate=False)
get_numeric_data = FunctionTransformer(lambda x: x[['num1','num2']], validate=False)

process_and_join_features = FeatureUnion(
         [
            ('numeric_features', Pipeline([
                ('selector', get_numeric_data),
                ('clf', OneVsRestClassifier(LogisticRegression()))
            ])),
             ('text_features', Pipeline([
                ('selector', get_text_data),
                ('vec', CountVectorizer()),
                ('clf', OneVsRestClassifier(LogisticRegression()))
            ]))
         ]
    )

In this code 'text' is the text columns and 'num1','num2' are 2 numeric column. 在此代码中,'text'是文本列,'num1','num2'是2个数字列。

The error message is 错误消息是

TypeError: All estimators should implement fit and transform. 'Pipeline(memory=None,
 steps=[('selector', FunctionTransformer(accept_sparse=False,
      func=<function <lambda> at 0x7fefa8efd840>, inv_kw_args=None,
      inverse_func=None, kw_args=None, pass_y='deprecated',
      validate=False)), ('clf', OneVsRestClassifier(estimator=LogisticRegression(C=1.0, class_weigh...=None, solver='liblinear', tol=0.0001,
      verbose=0, warm_start=False),
      n_jobs=1))])' (type <class 'sklearn.pipeline.Pipeline'>) doesn't

Any step I missed? 我错过了任何一步?

A FeatureUnion should be used as a step in the pipeline, not around the pipeline. FeatureUnion应该用作管道中的一个步骤,而不是管道周围。 The error you are getting is because you have a Classifier not as the final step - the union tries to call fit and transform on all transformers and a classifier does not have a transform method. 你得到的错误是因为你有一个分类器不是最后一步 - 联盟试图在所有变换器上调用fittransform ,而分类器没有transform方法。

Simply rework to have an outer pipeline with the classifier as the final step: 简单地返工以使用分类器作为最后一步的外部管道:

process_and_join_features = Pipeline([
    ('features', FeatureUnion([
            ('numeric_features', Pipeline([
                ('selector', get_numeric_data)
            ])),
             ('text_features', Pipeline([
                ('selector', get_text_data),
                ('vec', CountVectorizer())
            ]))
         ])),
    ('clf', OneVsRestClassifier(LogisticRegression()))
])

Also see here for a good example on the scikit-learn website doing this sort of thing. 还可以在这里看到scikit-learn网站做这类事情的一个很好的例子。

While I believe @Ken Syme correctly identified the problem and provided a fix for what you intend to do. 虽然我相信@Ken Syme正确地发现了问题并为你打算做什么提供了解决方案。 However, just in case you actually intend to use the output of the classifier as a feature for a higher level model, check out this blog . 但是,如果您确实打算将分类器的输出用作更高级别模型的功能,请查看此博客

Using the ModelTransformer by Zac, you can have your pipe as follows: 使用Zac的ModelTransformer,您可以按如下方式管道:

class ModelTransformer(TransformerMixin):

    def __init__(self, model):
        self.model = model

    def fit(self, *args, **kwargs):
        self.model.fit(*args, **kwargs)
        return self

    def transform(self, X, **transform_params):
        return DataFrame(self.model.predict(X))


process_and_join_features = FeatureUnion(
         [
            ('numeric_features', Pipeline([
                ('selector', get_numeric_data),
                ('clf', ModelTransformer(OneVsRestClassifier(LogisticRegression())))
            ])),
             ('text_features', Pipeline([
                ('selector', get_text_data),
                ('vec', CountVectorizer()),
                ('clf', ModelTransformer(OneVsRestClassifier(LogisticRegression())))
            ]))
         ]
)

Depending on your concrete next steps you still may have to wrap the FeatureUnion in a Pipeline (eg using the shortcut make_pipeline ). 根据具体的后续步骤,您仍可能需要将FeatureUnion包装在管道中(例如,使用快捷方式make_pipeline )。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 如何使用 sklearn Pipeline &amp; FeatureUnion 选择多个(数字和文本)列进行文本分类? - How to select multiple (numerical & text) columns using sklearn Pipeline & FeatureUnion for text classification? 如何使用 FeatureUnion 和 Pipeline 正确构建包含文本和数字数据的 SGDClassifier? - How to properly build a SGDClassifier with both text and numerical data using FeatureUnion and Pipeline? 如何保存sklearn FeatureUnion? - How to save sklearn FeatureUnion? 如何在 scikit-learn 中正确地将数字特征与文本(词袋)结合起来? - How do I properly combine numerical features with text (bag of words) in scikit-learn? sklearn 转换管道和 featureunion - sklearn transformation pipeline and featureunion FeatureUnion Sklearn管道出错 - Error in FeatureUnion Sklearn Pipeline 如何在python中的sklearn中获取GridSearchCV中的选定功能 - How to get the selected features in GridSearchCV in sklearn in python FeatureUnion:Sklearn FeatureUnion不允许拟合参数 - FeatureUnion : Sklearn FeatureUnion does not allows fit params Sklearn:异类要素的FeatureUnion在管道中使用分类器产生不兼容的行尺寸错误 - Sklearn: FeatureUnion of heterogenous features gives incompatible row dimensions error with classifier in the pipeline 如何在机器学习训练集中结合文本和数字特征? - How to Combine text and numerical features in training sets for machine learning?
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM