简体   繁体   中英

ValueError while using Scikit learn. Number of features of model don't match that of input

I am working on a classification problem using RandomForestClassifier. In the code I'm splitting the dataset into a train and test data for making predictions.

Here's the code:

from sklearn.ensemble import RandomForestClassifier
from sklearn.cross_validation import train_test_split
import numpy as np
from numpy import genfromtxt, savetxt

a = (np.genfromtxt(open('filepath.csv','r'), delimiter=',', dtype='int')[1:])
a_train, a_test = train_test_split(a, test_size=0.33, random_state=0)


def main():
    target = [x[0] for x in a_train]
    train = [x[1:] for x in a_train]

    rf = RandomForestClassifier(n_estimators=100)
    rf.fit(train, target)
    predicted_probs = [[index + 1, x[1]] for index, x in enumerate(rf.predict_proba(a_test))]

    savetxt('filepath.csv', predicted_probs, delimiter=',', fmt='%d,%f', 
            header='Id,PredictedProbability', comments = '')

if __name__=="__main__":
    main()

On exection however, I'm getting the following error:

ValueError: Number of features of the model must match the input. Model n_features is 1434 and input n_features is 1435

Any suggestions as to how I should proceed? Thanks.

It looks like you are using a_test directly, without stripping out the output feature.

The model is confused because it expects only 1434 input features but you are feeding it 1434 features along with the output feature.

You can fix this by doing the same thing with test that you did with train.

test = [x[1:] for x in a_test]

Then use test on the following line:

predicted_probs = [[index + 1, x[1]] for index, x in enumerate(rf.predict_proba(test))]

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM