交叉验证如何在学习曲线中发挥作用？ Python sklearn

Question

Say I have a learning curve that is sklearn learning curve SVM .假设我有一个学习曲线，即sklearn learning curve SVM 。 And I'm also doing 5-fold cross-validation, which as far as I understand, it means splitting your training data into 5 pieces, train on four of them and testing on the last one.而且我还在做 5 折交叉验证，据我了解，这意味着将您的训练数据分成 5 个部分，对其中的四个进行训练并在最后一个上进行测试。

So my question is, since for each data point in the LearningCurve , the size of the training set is different (Because we want to see how will the model perform with the increasing amount of data), how does the cross-validation work in that case?所以我的问题是，因为对于LearningCurve中的每个数据点，训练集的大小是不同的（因为我们想看看 model 将如何随着数据量的增加而执行），那么交叉验证是如何工作的案子？ Does it still split the whole training set into 5 equal pieces?它仍然将整个训练集分成 5 个相等的部分吗？ Or it splits the current point training set into five different small pieces, then computes the test score?还是将当前点训练集分成五个不同的小块，然后计算测试分数？ Is it possible to get a confusion matrix for each data point?是否可以获得每个数据点的混淆矩阵？ (ie True Positive, True Negative etc.). （即真阳性、真阴性等）。 I don't see a way to do that yet based on the sklearn learning curve code.我还没有看到基于 sklearn 学习曲线代码的方法。

Does how many folds of cross-validation relate to how many pieces of training set we are splitting in train_sizes = np.linspace(0.1, 1.0, 5) .交叉验证的折叠次数是否与我们在train_sizes = np.linspace(0.1, 1.0, 5)中拆分的训练集的数量有关。

train_sizes, train_scores, test_scores, fit_times, _ = learning_curve(estimator,
                                                                      X, y, cv, 
                                                                      n_jobs, scoring, 
                                                                      train_sizes)

Thank you!谢谢！

Answer 1

No, it does the split the training data into 5 folds again.不，它再次将训练数据分成 5 折。 Instead, for a particular combination of training folds (for example - folds 1,2,3 and 4 as training), it will pick only k number of data points (x- ticks) as training from those 4 training folds.相反，对于训练折叠的特定组合（例如 - 作为训练的折叠 1、2、3 和 4），它将仅从这 4 个训练折叠中选择 k 个数据点（x-tick）作为训练。 Test fold would be used as such as the testing data.测试折叠将用作测试数据。

If you look at the code here it would become clearer for you.如果您查看此处的代码，您会更清楚。

for train, test in cv_iter:
     for n_train_samples in train_sizes_abs:
          train_test_proportions.append((train[:n_train_samples], test))

n_train_samples would be something like [200,400,...1400] for the plot that you had mentioned.对于您提到的 plot， n_train_samples类似于[200,400,...1400] 。

Does how many folds of cross-validation relate to how many pieces of training set we are splitting in train_sizes = np.linspace(0.1, 1.0, 5)?交叉验证的折叠次数与我们在 train_sizes = np.linspace(0.1, 1.0, 5) 中拆分的训练集的数量有关吗？

we can't assign any number of folds for a certain train_sizes .我们不能为某个train_sizes分配任何数量的折叠。 It is just a subset of datapoints from all the training folds.它只是来自所有训练折叠的数据点的一个子集。

交叉验证如何在学习曲线中发挥作用？ Python sklearn

问题描述

1 个解决方案

解决方案1
0 已采纳 2020-06-01 08:57:47

交叉验证如何在学习曲线中发挥作用？ Python sklearn

问题描述

1 个解决方案

解决方案1 0 已采纳 2020-06-01 08:57:47

解决方案1
0 已采纳 2020-06-01 08:57:47