简体繁体 English

Python：在什么情况下随机森林和SVM分类器可以产生较高的准确性？

[英]Python: In which cases will random forest and SVM classifiers can produce high accuracy?

原文 2015-04-26 13:15:26 0 1 python/ classification/ svm/ random-forest

I am using Random Forest and SVM classifiers to do classification, and I have 18322 samples which are unbalanced in 9 classes (3667, 1060, 1267, 2103, 2174, 1495, 884, 1462, 4210). 我正在使用随机森林和SVM分类器进行分类，我有18322个样本在9个类别（3667、1060、1267、2103、2174、1495、884、1462、4210）中不平衡。 I use 10-fold CV and my training data has 100 feature dimensions. 我使用10倍简历，我的训练数据具有100个特征尺寸。 In my samples, training data are not very different in these 100 dimensions, and when I use SVM, the accuracy is approximately 40%, however, when I use RF, the accuracy can be 92%. 在我的样本中，训练数据在这100个维度上并没有太大差异，当我使用SVM时，精度大约为40％，但是当我使用RF时，精度可以为92％。 Then I make my data even less different in these 100 feature dimensions, however, RF can also give me accuracy of 92%, but the accuracy of SVM drops to 25%. 然后，我使数据在这100个特征维度上的差异变得更小，但是，RF的精度也可以达到92％，但SVM的精度却下降到25％。

My classifier configurations are: 我的分类器配置为：

SVM: LinearSVC(penalty="l1",dual=False) SVM：LinearSVC（penalty =“ l1”，dual = False）

RF: RandomForestClassifier(n_estimators = 50) RF：RandomForestClassifier（n_estimators = 50）

All other parameters are default values. 所有其他参数均为默认值。 I think there must be something wrong with my RF classifier but I don't know how to check it. 我认为我的RF分类器一定有问题，但我不知道如何检查。

Anyone familiar with these two classifiers can give me some hints? 熟悉这两个分类器的任何人都可以给我一些提示吗？

1 个解决方案

Linear SVC tries to separate your classes by finding appropriate hyperplanes in euclidean space. 线性SVC试图通过在欧氏空间中找到合适的超平面来分离您的类。 Your samples might just not be linearly separable causing poor performance. 您的样本可能无法线性分离，从而导致性能不佳。 Random Forest, on the other hand, uses several (in this case 50) simpler classifiers (Decision Trees), each of which has a piece-wise linear decision boundary. 另一方面，随机森林使用几个（在这种情况下为50个）较简单的分类器（决策树），每个分类器都有分段线性决策边界。 When you sum them together you end up with a much more complicated decision function. 当将它们加在一起时，最终会得到更加复杂的决策功能。

In my experience, RF tends to perform quite good with default parameters and even an extensive parameter search improves accuracy only a little. 以我的经验，RF倾向于在默认参数下表现良好，甚至广泛的参数搜索也只会稍微提高精度。 SVM behaves almost exactly opposite. SVM的行为几乎完全相反。

Have you tried different configurations? 您尝试过其他配置吗？ How about doing grid search for better parameters for the SVM? 如何为SVM进行网格搜索以寻找更好的参数？

Since you're already using sklearn you can use sklearn.grid_search.GridSearchCV , more details here 由于您已经在使用sklearn ，因此可以使用sklearn.grid_search.GridSearchCV ，在此更多详细信息