简体繁体 English

在sci-kit中使用训练/测试数据学习曲线，而不是交叉验证

[英]Learning curves with train/test data in sci-kit instead of cross validation

原文 2015-09-19 18:48:24 2 2 python/ machine-learning/ scipy/ scikit-learn

I have a my training and testing data separate (from different CSV loaded into different pandas dataframe) and I want to plot the learning curve with this training and testing data instead of training and test data generated from training set itself using cross validation (which seems to be the usual way learning_curve works). 我有一个单独的训练和测试数据（来自不同的CSV，分别加载到不同的pandas数据框中），我想用该训练和测试数据绘制学习曲线，而不是使用交叉验证从训练集本身生成的训练和测试数据（通常是learning_curve的工作方式）。

It seems like scikit expects your testing and training data to be present in the same Dataframe, but this way the classifier would learn the test data as well which is not what I want. 看起来scikit希望您的测试和培训数据存在于同一Dataframe中，但是这样分类器也将学习测试数据，这不是我想要的。

How can I go about solving this problem ? 我该如何解决这个问题？ I am new to sci-kit. 我是sci-kit的新手。

2 个解决方案

You will need to keep your training and test data separate (at least in separate variables within the code). 您将需要分开训练和测试数据（至少在代码中的单独变量中）。 The learning curve can then be applied on the training set. 然后可以将学习曲线应用于训练集。 This way you can optimize your experiment without using the test set (in order to avoid overfitting). 这样，您可以在不使用测试集的情况下优化实验（以避免过度拟合）。

To verify how well you are doing on the test set, scikit-learn offers the validation curve which evaluates against the test set. 为了验证您在测试集上的表现如何，scikit-learn提供了根据测试集评估的验证曲线。

Scikit-Learn is more tricky. Scikit-Learn更加棘手。 It allows you to define train_sizes of train and test sets and then runs a cross-validation on all of them (parameter cv, defaults to a 3-fold cross validation). 它允许您定义训练集和测试集的train_size，然后对所有训练集和测试集进行交叉验证（参数cv，默认为3倍交叉验证）。