简体繁体 English

python-拆分数据集以获得高性能准确性的最佳技术

[英]python- Best techniques to split datase to get high performance accuracy

原文 2019-11-17 16:17:33 2 2 python/ validation/ testing/ split/ training-data

I have applied these 4 methods:我已经应用了这4种方法：

Train and Test Sets.训练和测试集。
K-fold Cross Validation. K 折交叉验证。
Leave One Out Cross留一个十字架
Validation.验证。 Repeated Random Test-Train Splits.重复的随机测试训练拆分。

The method "Train and Test Sets" achieve high accuracy but the remaining methods achieve same accuracy but lower then first approach. “训练和测试集”方法实现了高准确度，但其余方法实现了相同的准确度，但低于第一种方法。

I want to know which method should I choose?我想知道我应该选择哪种方法？

2 个解决方案

Each of Train and Test Sets and Cross Validation used in certain case, Cross Validation used if you want to compare different models.Accuracy always increase if you use bigger training data that's why sometimes Leave One Out Cross perform better than K-fold Cross Validation ,it's depends on your dataset size and sometimes on algorithm you are using.On the other hand Train and Test Sets usually used if you aren't comparing diffrent models, and if the time requirements for running the cross validation aren't worth it,mean it's not needed to make Cross Validation in this case.In most cases Cross Validation is preferred,but, what method you should choose?在某些情况下使用的每个训练集和测试集以及交叉验证，如果您想比较不同的模型，则使用交叉验证。如果您使用更大的训练数据，准确度总是会提高，这就是为什么有时Leave One Out Cross比K-fold Cross Validation表现更好的原因，这取决于您的数据集大小，有时还取决于您使用的算法。另一方面，如果您不比较不同的模型，通常使用训练和测试集，并且如果运行交叉验证的时间要求不值得，意味着在这种情况下不需要进行交叉验证。在大多数情况下，交叉验证是首选，但是，你应该选择什么方法？ this usually depend on your choices while training your data such way you handle data and algorithm such you are trainning data using Random Forests usually it's not needed to do Cross Validation but you can and do it in case need more you usually not doing Cross Validation in Random Forests when you use Out of Bag estimate .这通常取决于您在训练数据时的选择，例如您处理数据和算法的方式，例如您使用随机森林训练数据通常不需要进行交叉验证，但如果需要更多您通常不进行交叉验证，您可以这样做使用Out of Bag 估计时的随机森林。

Training a model comprises tuning model accuracy as well as model generalization.训练 model 包括调整 model 准确度以及 model 泛化。 If model is not generalized it may be Underfit or Overfit model.如果 model 未泛化，则可能是欠拟合或过拟合model。

In this case, model may perform better on training data but accuracy may decrease on test or unknown data.在这种情况下，model 可能在训练数据上表现更好，但在测试或未知数据上的准确性可能会降低。

We use training data to improve the accuracy of model.我们使用训练数据来提高 model 的准确度。 As training data size increases model accuracy may also increase.随着训练数据大小的增加，model 的准确度也可能会增加。

Similarly we use different training samples to generalize the model.同样，我们使用不同的训练样本来概括 model。 So Train-Test splitting methods depend on the size of available data and algorithm used for model design.因此，训练-测试拆分方法取决于可用数据的大小和用于 model 设计的算法。

First train-test method has a fix size training and testing data.第一种训练测试方法具有固定大小的训练和测试数据。 So on each iteration, we use same train data to train model and same test data for model's accuracy assessment.所以在每次迭代中，我们使用相同的训练数据来训练 model 和相同的测试数据来评估模型的准确性。

Second k-fold method has fix size train and test data but on each iteration, test and train data changes.第二种k 折方法具有固定大小的训练和测试数据，但在每次迭代中，测试和训练数据都会发生变化。 So it may be a better approach irrespective of data size.因此，无论数据大小如何，它都可能是一种更好的方法。

Leave one out approach is useful only if data size is small.仅当数据量较小时，留一法才有用。 Here we use almost whole data for training purpose.在这里，我们使用几乎全部数据进行训练。 So training accuracy of model will be better but may not be a generalized model.所以 model 的训练精度会更好，但可能不是广义的 model。

Randomised Train-test method is also a good approach for training and testing model's performance.随机训练测试方法也是训练和测试模型性能的好方法。 Here we randomly select train and test data each time.在这里，我们随机 select 每次训练和测试数据。 So it may perform better than Leave one out method if data size is small.因此，如果数据量很小，它可能比Leave one out方法执行得更好。

And last each splitting approach has some pros and cons.最后，每种拆分方法都有一些优点和缺点。 So it depends on you which splitting method is good to your model.所以这取决于你哪种分割方法对你的 model 好。 It also depends on data size and data selection means how we are selecting data from sample while splitting.它还取决于数据大小，数据选择意味着我们在拆分时如何从样本中选择数据。