简体   繁体   中英

Sklearn Pipeline all the input array dimensions for the concatenation axis must match exactly

import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
from sklearn.pipeline import Pipeline
from sklearn.svm import LinearSVC
from sklearn.preprocessing import MinMaxScaler
from sklearn.compose import ColumnTransformer

data = [[1, 3, 4, 'text', 'pos'], [9, 3, 6, 'text more', 'neg']]
data = pd.DataFrame(data, columns=['Num1', 'Num2', 'Num3', 'Text field', 'Class'])

tweet_text_transformer = Pipeline(steps=[
    ('count_vectoriser', CountVectorizer()),
    ('tfidf', TfidfTransformer())
])

numeric_transformer = Pipeline(steps=[
    ('scaler', MinMaxScaler())
])

preprocessor = ColumnTransformer(transformers=[
    # (name, transformer, column(s))
    ('tweet', tweet_text_transformer, ['Text field']),
    ('numeric', numeric_transformer, ['Num1', 'Num2', 'Num3'])
])

pipeline = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('classifier', LinearSVC())
])

X_train = data.loc[:, 'Num1':'Text field']
y_train = data['Class']
pipeline.fit(X_train, y_train)

I don't understand where this error is coming from:

ValueError: all the input array dimensions for the concatenation axis must match exactly, but along dimension 0, the array at index 0 has size 1 and the array at index 1 has size 2

I implemented your code solution to convert the sparse matrix to an array and it fixed the error, however, when I call predict it shows another error

model = pipeline.fit(X_train,y_train)
y_pred = model.predict(X_test)

it give me this error

ValueError: X has 574 features per sample; expecting 493

My understanding that in this case it is not using the trained vectorizer model, but train a new one on the X_test dataset. How can I fix that, I don't know

NOTE: Need to add import statement for both BaseEstimator, TransformerMixin

UPDATE:

To fix this problem use FunctionTransformer instead of defining a class

Using FunctionTransformer instead of defining a class

vectorizer_params = dict(ngram_range=(1, 2), min_df=5, max_df=0.8)

TweetTextProcessor = Pipeline(steps=[
    ("squeez", FunctionTransformer(lambda x: x.squeeze())),
    ("vect", CountVectorizer(**vectorizer_params)),
    ("tfidf", TfidfTransformer()),
    ("toarray", FunctionTransformer(lambda x: x.toarray())),
])

numeric_transformer = Pipeline(steps=[
    ('scaler', MinMaxScaler())
])

preprocessor = ColumnTransformer(transformers=[
    ('tweet', TweetTextProcessor, ['Text field']),
    ('numeric', numeric_transformer, ['Num1', 'Num2', 'Num3'])
])

pipeline = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('classifier', LinearSVC())
])

Reason

The issue is in the preprocessor pipeline, The way this pipeline works is the output of tweet_text_transformer and the output of numeric_transformer are stacked horizontally, For this to successfully happen both the outputs(tweet_text_transformer and numeric_transformer) must have the same number of rows(ie: number of elements in axis 0 or dimension-0)

But when the above pipeline is executed the tweet_text_processor , though we expect it to give 2 * 2 matrix with 4 elements in reality since CountVectorizer stores the output as sparse matrix it removes any zeroes in the matrix(to save memory) this reduces the array to 2*2 matrix but with only 3 elements in it and when this to be stacked with the output of numeric_transformer it does not satisfy the above mentioned condition(since numeric transformer would have two elements in axis 0 and the twwet_text_processor would not)

Output of the explination

Solution

  • Create a custom transformer which converts this sparse matrix to numpy array
  • Also since there is only one column so squeeze the Pandas dataframe to convert it into Panadas Series
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
from sklearn.pipeline import Pipeline
from sklearn.svm import LinearSVC
from sklearn.preprocessing import MinMaxScaler
from sklearn.compose import ColumnTransformer

data = [[1, 3, 4, 'text', 'pos'], [9, 3, 6, 'text more', 'neg']]
data = pd.DataFrame(data, columns=['Num1', 'Num2', 'Num3', 'Text field', 'Class'])



class TweetTextProcessor(BaseEstimator, TransformerMixin):
    def __init__(self):
        self.tweet_text_transformer = Pipeline(steps=[
        ('count_vectoriser', CountVectorizer()),
        ('tfidf', TfidfTransformer())    ])
       
        
    def fit(self, X, y=None):
        return self

    def transform(self, X, y=None):
       
        return  self.tweet_text_transformer.fit_transform(X.squeeze()).toarray()
        




numeric_transformer = Pipeline(steps=[
    ('scaler', MinMaxScaler())
])

preprocessor = ColumnTransformer(transformers=[
    ('tweet', TweetTextProcessor(), ['Text field']),
    ('numeric', numeric_transformer, ['Num1', 'Num2', 'Num3'])
])

pipeline = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('classifier', LinearSVC())
])

X_train = data.loc[:, 'Num1':'Text field']
y_train = data['Class']
pipeline.fit(X_train, y_train)

The above code should work, Let me know otherwise or if the explanation was not clear(hopefully it is)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM