简体繁体 English

验证RandomizedSearchCV结果的问题

[英]Issue Validating RandomizedSearchCV Results

原文 2019-07-14 04:15:45 2 1 scikit-learn/ logistic-regression/ best-fit

I start with a basic Logistic Regression, using all defaults hyper-parameters. 我从基本的Logistic回归开始，使用所有默认的超参数。 And I get a score of 0.8855 我得到0.8855的分数

Question Next I run a RandomSearch to find the best hyper-parameters; 问题接下来，我运行RandomSearch来查找最佳的超参数。 According to the RandomSearch C=10 with Max_iterations=110 gives the score of 0.89 根据RandomSearch C = 10，Max_iterations = 110，得出的分数为0.89

I run the logistic with these hyper parameters but get a much better accuracy, 0.91 ! 我使用这些超级参数运行逻辑物流，但获得了更好的准确度0.91！

Why am I not getting exactly the same number? 为什么我的电话号码不完全相同？

1 个解决方案

You will definitely not get the same accuracy when you run it again in your train set, this is because when you do k-fold cross validation to check the performance of a particular set of hyper parameters you will divide the entire data into k sets and use k-1 sets for training and validate it on the left over one set. 在火车集中再次运行时，绝对不会获得相同的精度，这是因为当您进行k倍交叉验证以检查特定超参数集的性能时，会将整个数据分为k集合，使用k-1套训练，并在剩下的一套上进行验证。 And you repeat this process k times and each time you take a different set of data for validating. 然后，您会重复此过程k次，并且每次都使用一组不同的数据进行验证。 And finally you compute the average of all the k iterations and report your accuracy which is what you got in random_result.best_score_ , the figure below explains the process 最后，您计算所有k次迭代的平均值，并报告您在random_result.best_score_获得的random_result.best_score_ ，下图说明了该过程

And now after getting the best set of hyperparameters you will fit it on the entire training data ie set 1, set 2 and set 3, so now it is prone to have some variations since the data has changed and you are evaluating on the entire train data. 现在，在获得最佳的超参数集之后，您将其适合整个训练数据，即集合1，集合2和集合3，因此由于数据已更改并且您正在对整个火车进行评估，因此现在容易出现一些变化。数据。 So what you observe is totally normal and the usual behavior. 因此，您观察到的是完全正常和通常的行为。

Hope this helps! 希望这可以帮助！