简体   繁体   中英

When I try to fit scikit-learn model with 1 more feature, I have this error “ValueError: Found input variables with inconsistent numbers of samples”

I have this code working fine

    df_amazon = pd.read_csv ("datasets/amazon_alexa.tsv", sep="\t")

    X = df_amazon['variation'] # the features we want to analyze
    ylabels = df_amazon['feedback'] # the labels, or answers, we want to test against

    X_train, X_test, y_train, y_test = train_test_split(X, ylabels, test_size=0.3)

    # Create pipeline using Bag of Words
    pipe = Pipeline([('cleaner', predictors()),
                     ('vectorizer', bow_vector),
                     ('classifier', classifier)])

    pipe.fit(X_train,y_train)

But if I try to add 1 more feature to the model, replacing

    X = df_amazon['variation']

by

    X = df_amazon[['variation','verified_reviews']] 

I have this error message from Sklearn when I call fit :

ValueError: Found input variables with inconsistent numbers of samples: [2, 2205]

So fit works when X_train and y_train have the shapes (2205,) and (2205,).

But not when the shapes are changed to (2205, 2) and (2205,).

What's the best way to deal with that?

import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer

df = pd.DataFrame(data = [['Heather Gray Fabric','I received the echo as a gift.',1],['Sandstone Fabric','Without having a cellphone, I cannot use many of her features',0]], columns = ['variation','review','feedback'])


vect = CountVectorizer()
vect.fit_transform(df[['variation','review']])

# now when you look at vocab that has been created
print(vect.vocabulary_)

#o/p, where feature has been generated only for column name and not content of particular column
Out[49]:
{'variation': 1, 'review': 0} 

#so you need to make one column which contain which contain variation and review both and that  need to be passed into your model
df['variation_review'] = df['variation'] + df['review']

vect.fit_transform(df['variation_review'])
print(vect.vocabulary_)

{'heather': 8,
'gray': 6,
'fabrici': 3,
'received': 9,
'the': 11,
'echo': 2,
'as': 0,
'gift': 5,
'sandstone': 10,
'fabricwithout': 4,
'having': 7,
'cellphone': 1}

The data must have a shape (n_samples, n_features) . Try to traspose X ( XT ).

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM