简体繁体中英

Cross validation: cross_val_score function from scikit-learn arguments

原文 2018-05-04 14:05:25 2 1 python/ machine-learning/ scikit-learn/ cross-validation/ data-fitting

According to the DOC of scikit-learn

sklearn.model_selection.cross_val_score(estimator, X, y=None, groups=None, scoring=None, cv=None, n_jobs=1, verbose=0, fit_params=None, pre_dispatch='2*n_jobs')

X and y

X : array-like The data to fit. Can be for example a list, or an array.

y : array-like, optional, default: None The target variable to try to predict in the case of supervised learning.

I am wondering whether [X,y] is X_train and y_train or [X,y] should be the whole dataset. In some of the notebooks from kaggle some people use the whole dataset and some others X_train and y_train.

To my knowledge, cross validation just evaluate the model and shows whether or not you overfit/underfit your data (it does not actually train the model). Then, in my view the most data you have the better will be the performance, so I would use the whole dataset.

What do you think?

1 answers

Model performance is dependent on way the data is split and sometimes model does not have ability to generalize.

So that's why we need the cross validation.

Cross-validation is a vital step in evaluating a model. It maximizes the amount of data that is used to train the model, as during the course of training, the model is not only trained, but also tested on all of the available data.

I am wondering whether [X,y] is X_train and y_train or [X,y] should be the whole dataset.

[X, y] should be the whole dataset because internally cross validation spliting the data into training data and test data.

Suppose you use cross validation with 5 folds (cv = 5).

We begin by splitting the dataset into five groups or folds. Then we hold out the first fold as a test set, fit out model on the remaining four folds, predict on the test set and compute the metric of interest.

Next, we hold out the second fold as out test set, fit on the remaining data, predict on the test set and compute the metric of interest.

By default, scikit-learn's cross_val_score() function uses R^2 score as the metric of choice for regression.

R^2 score is called coefficient of determination.

scikit-learn scores are different when using cross_val_predict vs cross_val_score

Wrapper custom class for scikit-learn's Iterative Imputer for use with cross_val_score()

“scoring must return a number” cross_val_score error in scikit-learn

How to pass f1_score arguments to the make_scorer in scikit learn to use with cross_val_score?

Error in scikit.learn cross_val_score

Explication cross_val_score scikit_learn parameter cv

scikit-learn: Is the cross validation score evaluating the log loss function?

Scikit-learn cross_val_score throws ValueError: The first argument to `Layer.call` must always be passed

Scikit: calculate precision and recall using cross_val_score function

Scikit-learn cross val score: too many indices for array

暂无

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

Related Question scikit-learn scores are different when using cross_val_predict vs cross_val_score Wrapper custom class for scikit-learn's Iterative Imputer for use with cross_val_score() “scoring must return a number” cross_val_score error in scikit-learn How to pass f1_score arguments to the make_scorer in scikit learn to use with cross_val_score? Error in scikit.learn cross_val_score Explication cross_val_score scikit_learn parameter cv scikit-learn: Is the cross validation score evaluating the log loss function? Scikit-learn cross_val_score throws ValueError: The first argument to `Layer.call` must always be passed Scikit: calculate precision and recall using cross_val_score function Scikit-learn cross val score: too many indices for array

Related Tags

Cross validation: cross_val_score function from scikit-learn arguments

Question

1 answers

solution1 2 ACCPTED 2018-05-04 14:19:34

solution1
2 ACCPTED 2018-05-04 14:19:34