I am trying to use kfold in sklearn and really trying to understand what its doing. I am reading Python machine learninb 3rd edition by Sebastian Raschka.
In chapter 6 https://github.com/rasbt/python-machine-learning-book-3rd-edition/blob/master/ch06/ch06.ipynb
He has a code for StratifiedKFold
kfold = StratifiedKFold(n_splits=10).split(X_train, y_train)
scores = []
for k, (train, test) in enumerate(kfold):
pipe_lr.fit(X_train[train], y_train[train])
score = pipe_lr.score(X_train[test], y_train[test])
scores.append(score)
print('Fold: %2d, Class dist.: %s, Acc: %.3f' % (k+1,
np.bincount(y_train[train]), score))
print('\nCV accuracy: %.3f +/- %.3f' % (np.mean(scores), np.std(scores)))
So looking at kfold cross validation documentation https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.KFold.html this time they split the whole set but only X dataset.
So I am trying to use 10-fold cross validation, below is my code
gbr_onehot = GradientBoostingRegressor(
n_estimators = 1000,
learning_rate = 0.1,
random_state = 0
)
kfold = KFold(n_splits=10, shuffle=True, random_state=0).split(X)
train_score = []
test_score = []
for k, (train, test) in enumerate(kfold):
gbr_onehot.fit(X[train], y[train])
train_pred = gbr_onehot.predict(X[train])
train_score.append(metrics.mean_squared_error(train_pred, y[train]))
test_pred = gbr.onehot.predict(X[test])
test_score.append(metrics.mean_squared_error(test_pred, y[test]))
which gives me KeyError: "None of [Int64Index([ 0, 1, 2, 3, 4, 5, 6, 7, 9,\\n 10,\\n ...\\n 18313, 18314, 18315, 18316, 18317, 18318, 18319, 18320, 18321,\\n 18322],\\n dtype='int64', length=16490)] are in the [columns]
I've been using cross_val_score however I want to get train set's mse.
I've read many SO questions, and others however still confused.
I am not sure when he is instantiating kfold uses only train set? why not the whole set?
The proper way is to leave a sample to test if your model generalize well. However, KFold is to average the performance of the model while training, so the error will be averaged by holding one split at time. So every split will be held to verify and average the performance in-the-sample . Then the split you hold (say 0.25 of the overall dataset and which not included in the KFold), will be used to test the out-of-sample (for generalization purposes).
Why this time split whole dataset but only X?
Because they are giving an example about the use of the function, for the sake of the example not anything else. If they hole one split for testing the out-of-sample, they may confuse the reading about the inner-functionality of KFold.
The Error
It seems X != Y as an initial guess. However, use your IDE to point to the exact location of the problem.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.