简体   繁体   中英

Cross validation: cross_val_score function from scikit-learn arguments

According to the DOC of scikit-learn

sklearn.model_selection.cross_val_score(estimator, X, y=None, groups=None, scoring=None, cv=None, n_jobs=1, verbose=0, fit_params=None, pre_dispatch='2*n_jobs')

X and y

X : array-like The data to fit. Can be for example a list, or an array.

y : array-like, optional, default: None The target variable to try to predict in the case of supervised learning.

I am wondering whether [X,y] is X_train and y_train or [X,y] should be the whole dataset. In some of the notebooks from kaggle some people use the whole dataset and some others X_train and y_train.

To my knowledge, cross validation just evaluate the model and shows whether or not you overfit/underfit your data (it does not actually train the model). Then, in my view the most data you have the better will be the performance, so I would use the whole dataset.

What do you think?

Model performance is dependent on way the data is split and sometimes model does not have ability to generalize.

So that's why we need the cross validation.

Cross-validation is a vital step in evaluating a model. It maximizes the amount of data that is used to train the model, as during the course of training, the model is not only trained, but also tested on all of the available data.

I am wondering whether [X,y] is X_train and y_train or [X,y] should be the whole dataset.

[X, y] should be the whole dataset because internally cross validation spliting the data into training data and test data.

Suppose you use cross validation with 5 folds (cv = 5).

We begin by splitting the dataset into five groups or folds. Then we hold out the first fold as a test set, fit out model on the remaining four folds, predict on the test set and compute the metric of interest.

Next, we hold out the second fold as out test set, fit on the remaining data, predict on the test set and compute the metric of interest.

在此输入图像描述

By default, scikit-learn's cross_val_score() function uses R^2 score as the metric of choice for regression.

R^2 score is called coefficient of determination.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM