Randomforest cross validation: TypeError: 'KFold' object is not iterable

Question

Hi I am trying to run a random forest on SMOTE oversampled data but I am getting an error when I try to add a cross validation and ROC curve. The data I am using is a pandas dataframe where school is the group I want to predict (0 or 1).

import matplotlib.pyplot as plt
import numpy as np
from sklearn.datasets import make_classification
from sklearn.model_selection import KFold
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import roc_curve

y = df.school.values
X = df.drop(['school'],axis=1)

oversample = SMOTE()
over_X, over_y = oversample.fit_resample(X, y)
X_train, X_test, y_train, y_test = train_test_split(over_X, over_y, test_size=0.3, stratify=over_y)
kf = KFold(n_splits=10)

#ROC
tprs = []
base_fpr = np.linspace(0, 1, 101)

plt.figure(figsize=(5, 5))
plt.axes().set_aspect('equal', 'datalim')

for i, (train, test) in enumerate(kf):
    model = RandomForestClassifier(n_estimators=200,random_state=23).fit(over_X[train], over_y[train])
    y_score = model.predict_proba(over_X[test])
    fpr, tpr, _ = roc_curve(over_y[test], y_score[:, 1])
    
    plt.plot(fpr, tpr, 'b', alpha=0.15)
    tpr = np.interp(base_fpr, fpr, tpr)
    tpr[0] = 0.0
    tprs.append(tpr)

tprs = np.array(tprs)
mean_tprs = tprs.mean(axis=0)
std = tprs.std(axis=0)

tprs_upper = np.minimum(mean_tprs + std, 1)
tprs_lower = mean_tprs - std


plt.plot(base_fpr, mean_tprs, 'b')
plt.fill_between(base_fpr, tprs_lower, tprs_upper, color='grey', alpha=0.3)

plt.plot([0, 1], [0, 1],'r--')
plt.xlim([-0.01, 1.01])
plt.ylim([-0.01, 1.01])
plt.ylabel('True Positive Rate')
plt.xlabel('False Positive Rate')
plt.show()

error:

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
     21 plt.axes().set_aspect('equal', 'datalim')
     22 
---> 23 for i, (train, test) in enumerate(kf):
     24     model = RandomForestClassifier(n_estimators=200,random_state=23).fit(over_X[train], over_y[train])
     25     y_score = model.predict_proba(over_X[test])

TypeError: 'KFold' object is not iterable

Does anyone know what's wrong in my code?

PS I tried this but it's not working.

from sklearn.cross_validation import KFold

Answer 1

the test on each fold is on data the model has never seen before

from sklearn.model_selection import KFold

kdf=KFold(n_splits=5, shuffle=True, random_state=123)

for train_index, test_index = kdf.split(train):
    cv_train,cv_test=train.iloc[train_index], train.iloc[test_index]

Answer 2

By doing enumerate(kf) you are assuming kf is an iterable, which isn't. Scikit-learn's cross-validation schemes (such as KFold) have a split method that returns an iterable yielding data splits.

Try the following:

for i, (train, test) in enumerate(kf.split(over_X))

Additionally, the SMOTE oversampling should be done within the cross-validation loop to avoid data-leakage issues. In greater detail, you should fit_transform on the training portion and transform on the validation portion.

Randomforest cross validation: TypeError: 'KFold' object is not iterable

Question

2 answers

solution1
0 2022-03-09 15:15:13

solution2
0 2022-03-11 15:30:32

Randomforest cross validation: TypeError: 'KFold' object is not iterable

Question

2 answers

solution1 0 2022-03-09 15:15:13

solution2 0 2022-03-11 15:30:32

solution1
0 2022-03-09 15:15:13

solution2
0 2022-03-11 15:30:32