Sklearn Pipeline all the input array dimensions for the concatenation axis must match exactly

Question

import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
from sklearn.pipeline import Pipeline
from sklearn.svm import LinearSVC
from sklearn.preprocessing import MinMaxScaler
from sklearn.compose import ColumnTransformer

data = [[1, 3, 4, 'text', 'pos'], [9, 3, 6, 'text more', 'neg']]
data = pd.DataFrame(data, columns=['Num1', 'Num2', 'Num3', 'Text field', 'Class'])

tweet_text_transformer = Pipeline(steps=[
    ('count_vectoriser', CountVectorizer()),
    ('tfidf', TfidfTransformer())
])

numeric_transformer = Pipeline(steps=[
    ('scaler', MinMaxScaler())
])

preprocessor = ColumnTransformer(transformers=[
    # (name, transformer, column(s))
    ('tweet', tweet_text_transformer, ['Text field']),
    ('numeric', numeric_transformer, ['Num1', 'Num2', 'Num3'])
])

pipeline = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('classifier', LinearSVC())
])

X_train = data.loc[:, 'Num1':'Text field']
y_train = data['Class']
pipeline.fit(X_train, y_train)

I don't understand where this error is coming from:

ValueError: all the input array dimensions for the concatenation axis must match exactly, but along dimension 0, the array at index 0 has size 1 and the array at index 1 has size 2

Answer 1

I implemented your code solution to convert the sparse matrix to an array and it fixed the error, however, when I call predict it shows another error

model = pipeline.fit(X_train,y_train)
y_pred = model.predict(X_test)

it give me this error

ValueError: X has 574 features per sample; expecting 493

My understanding that in this case it is not using the trained vectorizer model, but train a new one on the X_test dataset. How can I fix that, I don't know

NOTE: Need to add import statement for both BaseEstimator, TransformerMixin

UPDATE:

To fix this problem use FunctionTransformer instead of defining a class

Using FunctionTransformer instead of defining a class

vectorizer_params = dict(ngram_range=(1, 2), min_df=5, max_df=0.8)

TweetTextProcessor = Pipeline(steps=[
    ("squeez", FunctionTransformer(lambda x: x.squeeze())),
    ("vect", CountVectorizer(**vectorizer_params)),
    ("tfidf", TfidfTransformer()),
    ("toarray", FunctionTransformer(lambda x: x.toarray())),
])

numeric_transformer = Pipeline(steps=[
    ('scaler', MinMaxScaler())
])

preprocessor = ColumnTransformer(transformers=[
    ('tweet', TweetTextProcessor, ['Text field']),
    ('numeric', numeric_transformer, ['Num1', 'Num2', 'Num3'])
])

pipeline = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('classifier', LinearSVC())
])

Answer 2

Reason

The issue is in the preprocessor pipeline, The way this pipeline works is the output of tweet_text_transformer and the output of numeric_transformer are stacked horizontally, For this to successfully happen both the outputs(tweet_text_transformer and numeric_transformer) must have the same number of rows(ie: number of elements in axis 0 or dimension-0)

But when the above pipeline is executed the tweet_text_processor , though we expect it to give 2 * 2 matrix with 4 elements in reality since CountVectorizer stores the output as sparse matrix it removes any zeroes in the matrix(to save memory) this reduces the array to 2*2 matrix but with only 3 elements in it and when this to be stacked with the output of numeric_transformer it does not satisfy the above mentioned condition(since numeric transformer would have two elements in axis 0 and the twwet_text_processor would not)

Output of the explination

Solution

Create a custom transformer which converts this sparse matrix to numpy array
Also since there is only one column so squeeze the Pandas dataframe to convert it into Panadas Series

import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
from sklearn.pipeline import Pipeline
from sklearn.svm import LinearSVC
from sklearn.preprocessing import MinMaxScaler
from sklearn.compose import ColumnTransformer

data = [[1, 3, 4, 'text', 'pos'], [9, 3, 6, 'text more', 'neg']]
data = pd.DataFrame(data, columns=['Num1', 'Num2', 'Num3', 'Text field', 'Class'])



class TweetTextProcessor(BaseEstimator, TransformerMixin):
    def __init__(self):
        self.tweet_text_transformer = Pipeline(steps=[
        ('count_vectoriser', CountVectorizer()),
        ('tfidf', TfidfTransformer())    ])
       
        
    def fit(self, X, y=None):
        return self

    def transform(self, X, y=None):
       
        return  self.tweet_text_transformer.fit_transform(X.squeeze()).toarray()
        




numeric_transformer = Pipeline(steps=[
    ('scaler', MinMaxScaler())
])

preprocessor = ColumnTransformer(transformers=[
    ('tweet', TweetTextProcessor(), ['Text field']),
    ('numeric', numeric_transformer, ['Num1', 'Num2', 'Num3'])
])

pipeline = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('classifier', LinearSVC())
])

X_train = data.loc[:, 'Num1':'Text field']
y_train = data['Class']
pipeline.fit(X_train, y_train)

The above code should work, Let me know otherwise or if the explanation was not clear(hopefully it is)

Sklearn Pipeline all the input array dimensions for the concatenation axis must match exactly

Question

2 answers

solution1
1 2022-05-18 06:39:27

UPDATE:

solution2
0 2021-04-22 03:51:59

Reason

Solution

Sklearn Pipeline all the input array dimensions for the concatenation axis must match exactly

Question

2 answers

solution1 1 2022-05-18 06:39:27

UPDATE:

solution2 0 2021-04-22 03:51:59

Reason

Solution

solution1
1 2022-05-18 06:39:27

solution2
0 2021-04-22 03:51:59