Hi I am trying to run a random forest on SMOTE oversampled data but I am getting an error when I try to add a cross validation and ROC curve. The data I am using is a pandas dataframe where school is the group I want to predict (0 or 1).
import matplotlib.pyplot as plt
import numpy as np
from sklearn.datasets import make_classification
from sklearn.model_selection import KFold
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import roc_curve
y = df.school.values
X = df.drop(['school'],axis=1)
oversample = SMOTE()
over_X, over_y = oversample.fit_resample(X, y)
X_train, X_test, y_train, y_test = train_test_split(over_X, over_y, test_size=0.3, stratify=over_y)
kf = KFold(n_splits=10)
#ROC
tprs = []
base_fpr = np.linspace(0, 1, 101)
plt.figure(figsize=(5, 5))
plt.axes().set_aspect('equal', 'datalim')
for i, (train, test) in enumerate(kf):
model = RandomForestClassifier(n_estimators=200,random_state=23).fit(over_X[train], over_y[train])
y_score = model.predict_proba(over_X[test])
fpr, tpr, _ = roc_curve(over_y[test], y_score[:, 1])
plt.plot(fpr, tpr, 'b', alpha=0.15)
tpr = np.interp(base_fpr, fpr, tpr)
tpr[0] = 0.0
tprs.append(tpr)
tprs = np.array(tprs)
mean_tprs = tprs.mean(axis=0)
std = tprs.std(axis=0)
tprs_upper = np.minimum(mean_tprs + std, 1)
tprs_lower = mean_tprs - std
plt.plot(base_fpr, mean_tprs, 'b')
plt.fill_between(base_fpr, tprs_lower, tprs_upper, color='grey', alpha=0.3)
plt.plot([0, 1], [0, 1],'r--')
plt.xlim([-0.01, 1.01])
plt.ylim([-0.01, 1.01])
plt.ylabel('True Positive Rate')
plt.xlabel('False Positive Rate')
plt.show()
error:
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
21 plt.axes().set_aspect('equal', 'datalim')
22
---> 23 for i, (train, test) in enumerate(kf):
24 model = RandomForestClassifier(n_estimators=200,random_state=23).fit(over_X[train], over_y[train])
25 y_score = model.predict_proba(over_X[test])
TypeError: 'KFold' object is not iterable
Does anyone know what's wrong in my code?
PS I tried this but it's not working.
from sklearn.cross_validation import KFold
the test on each fold is on data the model has never seen before
from sklearn.model_selection import KFold
kdf=KFold(n_splits=5, shuffle=True, random_state=123)
for train_index, test_index = kdf.split(train):
cv_train,cv_test=train.iloc[train_index], train.iloc[test_index]
By doing enumerate(kf) you are assuming kf is an iterable, which isn't. Scikit-learn's cross-validation schemes (such as KFold) have a split method that returns an iterable yielding data splits.
Try the following:
for i, (train, test) in enumerate(kf.split(over_X))
Additionally, the SMOTE oversampling should be done within the cross-validation loop to avoid data-leakage issues. In greater detail, you should fit_transform on the training portion and transform on the validation portion.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.