Random Forest Classifier Batch Learning Python Dimension Error

Question

I have a large dataframe with around a million records and 19 features (+1 target variable). Since I was unable to train my RF classifier due to memory error (it's a multi-class classification with around 750 classes) hence I resorted to batch learning. The model is trained fine, but when I run the model.predict command, it gives me the following ValueError :

ValueError: operands could not be broadcast together with shapes (231106,628) (231106,620) (231106,628).

My code is the following:

#Splitting into Dependent and Independent Variables

X= df.iloc[:,1:]
y= df.iloc[:,0]

#Train-test Split

train_X, test_X, train_y, test_y = train_test_split(X,y,test_size=0.25,random_state=1234) 

data_splits= zip(np.array_split(train_X,6),np.array_split(train_y,6))

rf_clf= RandomForestClassifier(warm_start=True, n_estimators=1,criterion='entropy',random_state=1234)

for i in range(10): #10 passes through the data
    for X,y in data_splits:
        rf_clf.fit(X,y)
        rf_clf.n_estimators +=1 # increment by one, so next will add 1 tree

y_preds= rf_clf.predict(test_X)

I would be highly grateful for any help. Any other suggestions are also welcomed.

Answer 1

Found the answer. This was happening due to the inconsistency of y-variable classes in the data batches.

Random Forest Classifier Batch Learning Python Dimension Error

Question

1 answers

solution1
0 ACCPTED 2020-04-29 11:57:11

Random Forest Classifier Batch Learning Python Dimension Error

Question

1 answers

solution1 0 ACCPTED 2020-04-29 11:57:11

solution1
0 ACCPTED 2020-04-29 11:57:11