Sklearn FeatureUnion 返回 TypeError: no supported conversion for types: (dtype('int64'), dtype('O'))

Question

我正在嘗試聯合兩條管道：

pipeline_1返回 float64 的稀疏矩陣
pipeline_2以 pandas DataFrame 的形式返回原始列 (str)（一個 Series 不會導致錯誤ValueError: blocks[0,:] has incompatible row dimensions. ）

這樣做時，我收到錯誤：

TypeError：不支持類型轉換：（dtype（'int64'），dtype（'O'））

我的目標是找到一種通用方法，將 DataFrame 的原始列保留在管道中，以供分類器稍后使用。

代碼：

import pandas as pd

from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.pipeline import Pipeline, FeatureUnion


class ColumnSelector(BaseEstimator, TransformerMixin):

    def __init__(self, key, transform_function=None):
        self.key = key
        self.transform_function = transform_function

    def fit(self, X, y=None, *parg, **kwarg):
        return self

    def transform(self, X):
        result = X[self.key]
        if self.transform_function:
            result = self.transform_function(result)
        return result


data = [
    {'col1': 'hello my friend', 'col2': 'somestring_'},
    {'col1': 'my friend', 'col2': 'somestring__'},
    {'col1': 'hello friend', 'col2': 'somestring___'}
]
df = pd.DataFrame(data)



pipeline_1 = Pipeline([
    ('selector', ColumnSelector(key='col1')),
    ('vectorizer', CountVectorizer())
])

pipeline_2 = Pipeline([
    ('test', ColumnSelector(key='col2'))#, transform_function=lambda col: col.to_frame())),
])

feats = FeatureUnion([('count_vectorize', pipeline_1), ('original_column', pipeline_2)])

feats.fit_transform(df)

Answer 1

FeatureUnion 使用 numpy 或 scipy 稀疏運算來加入其中每個特征的 output。 因此，您不能在 FeatureUnion 中有任何可以返回非數值的步驟。

如果我更改您的pipeline2以返回給定字符串中的字符數，它將開始工作。

注意：您可以使用ColumnTransformer中的sklearn.compose 。

import pandas as pd

from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.pipeline import Pipeline, FeatureUnion


class ColumnSelector(BaseEstimator, TransformerMixin):

    def __init__(self, key, transform_function=None):
        self.key = key
        self.transform_function = transform_function

    def fit(self, X, y=None, *parg, **kwarg):
        return self

    def transform(self, X):
        result = X[self.key]
        if self.transform_function:
            result = self.transform_function(result)
        return result


data = [
    {'col1': 'hello my friend', 'col2': 'somestring_'},
    {'col1': 'my friend', 'col2': 'somestring__'},
    {'col1': 'hello friend', 'col2': 'somestring___'}
]
df = pd.DataFrame(data)



pipeline_1 = Pipeline([
    ('selector', ColumnSelector(key='col1')),
    ('vectorizer', CountVectorizer())
])

pipeline_2 = Pipeline([
    ('test', ColumnSelector(key='col2',transform_function=lambda x: [[len(i)] for i in x]))#, transform_function=lambda col: col.to_frame())),
])

feats = FeatureUnion([('count_vectorize', pipeline_1), ('original_column', pipeline_2)])

feats.fit_transform(df)

Sklearn FeatureUnion 返回 TypeError: no supported conversion for types: (dtype('int64'), dtype('O'))

問題描述

1 個解決方案

解決方案1
0 2020-05-05 13:38:41

Sklearn FeatureUnion 返回 TypeError: no supported conversion for types: (dtype('int64'), dtype('O'))

問題描述

1 個解決方案

解決方案1 0 2020-05-05 13:38:41

解決方案1
0 2020-05-05 13:38:41