简体繁体 English

交叉验证：来自scikit-learn参数的cross_val_score函数

[英]Cross validation: cross_val_score function from scikit-learn arguments

原文 2018-05-04 14:05:25 7 1 python/ machine-learning/ scikit-learn/ cross-validation/ data-fitting

According to the DOC of scikit-learn 根据scikit-learn的DOC

sklearn.model_selection.cross_val_score(estimator, X, y=None, groups=None, scoring=None, cv=None, n_jobs=1, verbose=0, fit_params=None, pre_dispatch='2*n_jobs') sklearn.model_selection.cross_val_score（estimator，X，y = None，groups = None，scoring = None，cv = None，n_jobs = 1，verbose = 0，fit_params = None，pre_dispatch ='2 * n_jobs'）

X and y X和y

X : array-like The data to fit. X：array-like要适合的数据。 Can be for example a list, or an array. 可以是例如列表或数组。

y : array-like, optional, default: None The target variable to try to predict in the case of supervised learning. y：array-like，optional，default：None在监督学习的情况下尝试预测的目标变量。

I am wondering whether [X,y] is X_train and y_train or [X,y] should be the whole dataset. 我想知道[X，y]是X_train，y_train还是[X，y]应该是整个数据集。 In some of the notebooks from kaggle some people use the whole dataset and some others X_train and y_train. 在一些来自kaggle的笔记本中，有些人使用整个数据集，还有一些人使用X_train和y_train。

To my knowledge, cross validation just evaluate the model and shows whether or not you overfit/underfit your data (it does not actually train the model). 据我所知，交叉验证只是评估模型并显示您是否过度匹配/不适合您的数据（它实际上并不训练模型）。 Then, in my view the most data you have the better will be the performance, so I would use the whole dataset. 然后，在我看来，你拥有的数据越多，性能就越好，所以我会使用整个数据集。

What do you think? 你怎么看？

1 个解决方案

Model performance is dependent on way the data is split and sometimes model does not have ability to generalize. 模型performance取决于数据分割的方式，有时模型没有概括的能力。

So that's why we need the cross validation. 这就是我们需要交叉验证的原因。

Cross-validation is a vital step in evaluating a model. Cross-validation是评估模型的关键步骤。 It maximizes the amount of data that is used to train the model, as during the course of training, the model is not only trained, but also tested on all of the available data. 它最大化了用于训练模型的数据量，因为在训练过程中，模型不仅经过培训，而且还在所有可用数据上进行测试。

I am wondering whether [X,y] is X_train and y_train or [X,y] should be the whole dataset. 我想知道[X，y]是X_train，y_train还是[X，y]应该是整个数据集。

[X, y] should be the whole dataset because internally cross validation spliting the data into training data and test data. [X, y]应该是整个数据集，因为内部交叉验证将数据分成training数据和test数据。

Suppose you use cross validation with 5 folds (cv = 5). 假设您使用5次交叉验证（cv = 5）。

We begin by splitting the dataset into five groups or folds. 我们首先将数据集拆分为五组或折叠。 Then we hold out the first fold as a test set, fit out model on the remaining four folds, predict on the test set and compute the metric of interest. 然后我们将第一个折叠作为测试集，在剩余的四个折叠上拟合模型，在测试集上预测并计算感兴趣的度量。

Next, we hold out the second fold as out test set, fit on the remaining data, predict on the test set and compute the metric of interest. 接下来，我们将第二个折叠作为输出测试集，适合剩余数据，在测试集上预测并计算感兴趣的度量。

By default, scikit-learn's cross_val_score() function uses R^2 score as the metric of choice for regression. 默认情况下，scikit-learn的cross_val_score()函数使用R^2得分作为回归的选择度量。

R^2 score is called coefficient of determination. R^2得分称为确定系数。