简体   繁体   English

交叉验证:来自scikit-learn参数的cross_val_score函数

[英]Cross validation: cross_val_score function from scikit-learn arguments

According to the DOC of scikit-learn 根据scikit-learn的DOC

sklearn.model_selection.cross_val_score(estimator, X, y=None, groups=None, scoring=None, cv=None, n_jobs=1, verbose=0, fit_params=None, pre_dispatch='2*n_jobs') sklearn.model_selection.cross_val_score(estimator,X,y = None,groups = None,scoring = None,cv = None,n_jobs = 1,verbose = 0,fit_params = None,pre_dispatch ='2 * n_jobs')

X and y X和y

X : array-like The data to fit. X:array-like要适合的数据。 Can be for example a list, or an array. 可以是例如列表或数组。

y : array-like, optional, default: None The target variable to try to predict in the case of supervised learning. y:array-like,optional,default:None在监督学习的情况下尝试预测的目标变量。

I am wondering whether [X,y] is X_train and y_train or [X,y] should be the whole dataset. 我想知道[X,y]是X_train,y_train还是[X,y]应该是整个数据集。 In some of the notebooks from kaggle some people use the whole dataset and some others X_train and y_train. 在一些来自kaggle的笔记本中,有些人使用整个数据集,还有一些人使用X_train和y_train。

To my knowledge, cross validation just evaluate the model and shows whether or not you overfit/underfit your data (it does not actually train the model). 据我所知,交叉验证只是评估模型并显示您是否过度匹配/不适合您的数据(它实际上并不训练模型)。 Then, in my view the most data you have the better will be the performance, so I would use the whole dataset. 然后,在我看来,你拥有的数据越多,性能就越好,所以我会使用整个数据集。

What do you think? 你怎么看?

Model performance is dependent on way the data is split and sometimes model does not have ability to generalize. 模型performance取决于数据分割的方式,有时模型没有概括的能力。

So that's why we need the cross validation. 这就是我们需要交叉验证的原因。

Cross-validation is a vital step in evaluating a model. Cross-validation是评估模型的关键步骤。 It maximizes the amount of data that is used to train the model, as during the course of training, the model is not only trained, but also tested on all of the available data. 它最大化了用于训练模型的数据量,因为在训练过程中,模型不仅经过培训,而且还在所有可用数据上进行测试。

I am wondering whether [X,y] is X_train and y_train or [X,y] should be the whole dataset. 我想知道[X,y]是X_train,y_train还是[X,y]应该是整个数据集。

[X, y] should be the whole dataset because internally cross validation spliting the data into training data and test data. [X, y]应该是整个数据集,因为内部交叉验证将数据分成training数据和test数据。

Suppose you use cross validation with 5 folds (cv = 5). 假设您使用5次交叉验证(cv = 5)。

We begin by splitting the dataset into five groups or folds. 我们首先将数据集拆分为五组或折叠。 Then we hold out the first fold as a test set, fit out model on the remaining four folds, predict on the test set and compute the metric of interest. 然后我们将第一个折叠作为测试集,在剩余的四个折叠上拟合模型,在测试集上预测并计算感兴趣的度量。

Next, we hold out the second fold as out test set, fit on the remaining data, predict on the test set and compute the metric of interest. 接下来,我们将第二个折叠作为输出测试集,适合剩余数据,在测试集上预测并计算感兴趣的度量。

在此输入图像描述

By default, scikit-learn's cross_val_score() function uses R^2 score as the metric of choice for regression. 默认情况下,scikit-learn的cross_val_score()函数使用R^2得分作为回归的选择度量。

R^2 score is called coefficient of determination. R^2得分称为确定系数。

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 使用 cross_val_predict 与 cross_val_score 时,scikit-learn 分数不同 - scikit-learn scores are different when using cross_val_predict vs cross_val_score 包装器自定义 class 用于 scikit-learn 的迭代输入器,与 cross_val_score() 一起使用 - Wrapper custom class for scikit-learn's Iterative Imputer for use with cross_val_score() “得分必须返回一个数字”scikit-learn中的cross_val_score错误 - “scoring must return a number” cross_val_score error in scikit-learn 如何将 f1_score arguments 传递给 scikit 中的 make_scorer 学习与 cross_val_score 一起使用? - How to pass f1_score arguments to the make_scorer in scikit learn to use with cross_val_score? scikit.learn cross_val_score 中的错误 - Error in scikit.learn cross_val_score 解释 cross_val_score scikit_learn 参数 cv - Explication cross_val_score scikit_learn parameter cv scikit-learn:交叉验证评分是否评估了日志丢失函数? - scikit-learn: Is the cross validation score evaluating the log loss function? Scikit-learn cross_val_score 抛出 ValueError:必须始终传递“Layer.call”的第一个参数 - Scikit-learn cross_val_score throws ValueError: The first argument to `Layer.call` must always be passed Scikit:使用cross_val_score函数计算精度和召回率 - Scikit: calculate precision and recall using cross_val_score function Scikit-learn cross val得分:数组的索引太多了 - Scikit-learn cross val score: too many indices for array
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM