[英]how to featureUnion numerical and text features in python sklearn properly
I'm trying to use featureunion for the 1st time in sklearn pipeline to combine numerical (2 columns) and text features (1 column) for multi-class classification. 我正在尝试在sklearn管道中第一次使用featureunion来组合数字(2列)和文本特征(1列)以进行多类分类。
from sklearn.preprocessing import FunctionTransformer
from sklearn.pipeline import Pipeline
from sklearn.multiclass import OneVsRestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import FeatureUnion
get_text_data = FunctionTransformer(lambda x: x['text'], validate=False)
get_numeric_data = FunctionTransformer(lambda x: x[['num1','num2']], validate=False)
process_and_join_features = FeatureUnion(
[
('numeric_features', Pipeline([
('selector', get_numeric_data),
('clf', OneVsRestClassifier(LogisticRegression()))
])),
('text_features', Pipeline([
('selector', get_text_data),
('vec', CountVectorizer()),
('clf', OneVsRestClassifier(LogisticRegression()))
]))
]
)
In this code 'text' is the text columns and 'num1','num2' are 2 numeric column. 在此代码中,'text'是文本列,'num1','num2'是2个数字列。
The error message is 错误消息是
TypeError: All estimators should implement fit and transform. 'Pipeline(memory=None,
steps=[('selector', FunctionTransformer(accept_sparse=False,
func=<function <lambda> at 0x7fefa8efd840>, inv_kw_args=None,
inverse_func=None, kw_args=None, pass_y='deprecated',
validate=False)), ('clf', OneVsRestClassifier(estimator=LogisticRegression(C=1.0, class_weigh...=None, solver='liblinear', tol=0.0001,
verbose=0, warm_start=False),
n_jobs=1))])' (type <class 'sklearn.pipeline.Pipeline'>) doesn't
Any step I missed? 我错过了任何一步?
A FeatureUnion
should be used as a step in the pipeline, not around the pipeline. FeatureUnion
应该用作管道中的一个步骤,而不是管道周围。 The error you are getting is because you have a Classifier not as the final step - the union tries to call fit
and transform
on all transformers and a classifier does not have a transform
method. 你得到的错误是因为你有一个分类器不是最后一步 - 联盟试图在所有变换器上调用
fit
和transform
,而分类器没有transform
方法。
Simply rework to have an outer pipeline with the classifier as the final step: 简单地返工以使用分类器作为最后一步的外部管道:
process_and_join_features = Pipeline([
('features', FeatureUnion([
('numeric_features', Pipeline([
('selector', get_numeric_data)
])),
('text_features', Pipeline([
('selector', get_text_data),
('vec', CountVectorizer())
]))
])),
('clf', OneVsRestClassifier(LogisticRegression()))
])
Also see here for a good example on the scikit-learn website doing this sort of thing. 还可以在这里看到scikit-learn网站做这类事情的一个很好的例子。
While I believe @Ken Syme correctly identified the problem and provided a fix for what you intend to do. 虽然我相信@Ken Syme正确地发现了问题并为你打算做什么提供了解决方案。 However, just in case you actually intend to use the output of the classifier as a feature for a higher level model, check out this blog .
但是,如果您确实打算将分类器的输出用作更高级别模型的功能,请查看此博客 。
Using the ModelTransformer by Zac, you can have your pipe as follows: 使用Zac的ModelTransformer,您可以按如下方式管道:
class ModelTransformer(TransformerMixin):
def __init__(self, model):
self.model = model
def fit(self, *args, **kwargs):
self.model.fit(*args, **kwargs)
return self
def transform(self, X, **transform_params):
return DataFrame(self.model.predict(X))
process_and_join_features = FeatureUnion(
[
('numeric_features', Pipeline([
('selector', get_numeric_data),
('clf', ModelTransformer(OneVsRestClassifier(LogisticRegression())))
])),
('text_features', Pipeline([
('selector', get_text_data),
('vec', CountVectorizer()),
('clf', ModelTransformer(OneVsRestClassifier(LogisticRegression())))
]))
]
)
Depending on your concrete next steps you still may have to wrap the FeatureUnion in a Pipeline (eg using the shortcut make_pipeline ). 根据具体的后续步骤,您仍可能需要将FeatureUnion包装在管道中(例如,使用快捷方式make_pipeline )。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.