简体   繁体   English

交叉验证如何在学习曲线中发挥作用? Python sklearn

[英]how does the cross-validation work in learning curve? Python sklearn

Say I have a learning curve that is sklearn learning curve SVM .假设我有一个学习曲线,即sklearn learning curve SVM And I'm also doing 5-fold cross-validation, which as far as I understand, it means splitting your training data into 5 pieces, train on four of them and testing on the last one.而且我还在做 5 折交叉验证,据我了解,这意味着将您的训练数据分成 5 个部分,对其中的四个进行训练并在最后一个上进行测试。

So my question is, since for each data point in the LearningCurve , the size of the training set is different (Because we want to see how will the model perform with the increasing amount of data), how does the cross-validation work in that case?所以我的问题是,因为对于LearningCurve中的每个数据点,训练集的大小是不同的(因为我们想看看 model 将如何随着数据量的增加而执行),那么交叉验证是如何工作的案子? Does it still split the whole training set into 5 equal pieces?它仍然将整个训练集分成 5 个相等的部分吗? Or it splits the current point training set into five different small pieces, then computes the test score?还是将当前点训练集分成五个不同的小块,然后计算测试分数? Is it possible to get a confusion matrix for each data point?是否可以获得每个数据点的混淆矩阵? (ie True Positive, True Negative etc.). (即真阳性、真阴性等)。 I don't see a way to do that yet based on the sklearn learning curve code.我还没有看到基于 sklearn 学习曲线代码的方法。

Does how many folds of cross-validation relate to how many pieces of training set we are splitting in train_sizes = np.linspace(0.1, 1.0, 5) .交叉验证的折叠次数是否与我们在train_sizes = np.linspace(0.1, 1.0, 5)中拆分的训练集的数量有关。

train_sizes, train_scores, test_scores, fit_times, _ = learning_curve(estimator,
                                                                      X, y, cv, 
                                                                      n_jobs, scoring, 
                                                                      train_sizes)

Thank you!谢谢!

No, it does the split the training data into 5 folds again.不,它再次将训练数据分成 5 折。 Instead, for a particular combination of training folds (for example - folds 1,2,3 and 4 as training), it will pick only k number of data points (x- ticks) as training from those 4 training folds.相反,对于训练折叠的特定组合(例如 - 作为训练的折叠 1、2、3 和 4),它将仅从这 4 个训练折叠中选择 k 个数据点(x-tick)作为训练。 Test fold would be used as such as the testing data.测试折叠将用作测试数据。

If you look at the code here it would become clearer for you.如果您查看此处的代码,您会更清楚。

for train, test in cv_iter:
     for n_train_samples in train_sizes_abs:
          train_test_proportions.append((train[:n_train_samples], test))

n_train_samples would be something like [200,400,...1400] for the plot that you had mentioned.对于您提到的 plot, n_train_samples类似于[200,400,...1400]

Does how many folds of cross-validation relate to how many pieces of training set we are splitting in train_sizes = np.linspace(0.1, 1.0, 5)?交叉验证的折叠次数与我们在 train_sizes = np.linspace(0.1, 1.0, 5) 中拆分的训练集的数量有关吗?

we can't assign any number of folds for a certain train_sizes .我们不能为某个train_sizes分配任何数量的折叠。 It is just a subset of datapoints from all the training folds.它只是来自所有训练折叠的数据点的一个子集。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM