简体   繁体   中英

how to really understand kfold cross validation in sklearn

I am trying to use kfold in sklearn and really trying to understand what its doing. I am reading Python machine learninb 3rd edition by Sebastian Raschka.

In chapter 6 https://github.com/rasbt/python-machine-learning-book-3rd-edition/blob/master/ch06/ch06.ipynb

He has a code for StratifiedKFold

kfold = StratifiedKFold(n_splits=10).split(X_train, y_train)
scores = []
for k, (train, test) in enumerate(kfold):
    pipe_lr.fit(X_train[train], y_train[train])
    score = pipe_lr.score(X_train[test], y_train[test])
    scores.append(score)
    print('Fold: %2d, Class dist.: %s, Acc: %.3f' % (k+1,
          np.bincount(y_train[train]), score))

print('\nCV accuracy: %.3f +/- %.3f' % (np.mean(scores), np.std(scores)))
  1. I am not sure when he is instantiating kfold uses only train set? why not the whole set?

So looking at kfold cross validation documentation https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.KFold.html this time they split the whole set but only X dataset.

  1. Why this time split whole dataset but only X?

So I am trying to use 10-fold cross validation, below is my code

gbr_onehot = GradientBoostingRegressor(
n_estimators  = 1000,
learning_rate = 0.1,
random_state  = 0
)

kfold = KFold(n_splits=10, shuffle=True, random_state=0).split(X)

train_score = []
test_score  = []

for k, (train, test) in enumerate(kfold):
    gbr_onehot.fit(X[train], y[train])

    train_pred = gbr_onehot.predict(X[train])
    train_score.append(metrics.mean_squared_error(train_pred, y[train]))

    test_pred  = gbr.onehot.predict(X[test])
    test_score.append(metrics.mean_squared_error(test_pred, y[test]))

which gives me KeyError: "None of [Int64Index([ 0, 1, 2, 3, 4, 5, 6, 7, 9,\\n 10,\\n ...\\n 18313, 18314, 18315, 18316, 18317, 18318, 18319, 18320, 18321,\\n 18322],\\n dtype='int64', length=16490)] are in the [columns]

I've been using cross_val_score however I want to get train set's mse.

I've read many SO questions, and others however still confused.

I am not sure when he is instantiating kfold uses only train set? why not the whole set?

The proper way is to leave a sample to test if your model generalize well. However, KFold is to average the performance of the model while training, so the error will be averaged by holding one split at time. So every split will be held to verify and average the performance in-the-sample . Then the split you hold (say 0.25 of the overall dataset and which not included in the KFold), will be used to test the out-of-sample (for generalization purposes).

Why this time split whole dataset but only X?

Because they are giving an example about the use of the function, for the sake of the example not anything else. If they hole one split for testing the out-of-sample, they may confuse the reading about the inner-functionality of KFold.

The Error

It seems X != Y as an initial guess. However, use your IDE to point to the exact location of the problem.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM