简体繁体 English

如何提高matlab中决策树的准确性

[英]How to improve accuracy of decision tree in matlab

原文 2012-06-06 06:44:59 2 5 matlab/ machine-learning/ classification/ decision-tree

I have a set of data which I classify them in matlab using decision tree. 我有一组数据，我使用决策树在matlab中对它们进行分类。 I divide the set into two parts; 我将这组分为两部分; one training data(85%) and the other test data(15%). 一个训练数据（85％）和其他测试数据（15％）。 The problem is that the accuracy is around %90 and I do not know how I can improve it. 问题是准确率在90左右，我不知道如何改进它。 I would appreciate if you have any idea about it. 如果您对此有任何想法，我将不胜感激。

5 个解决方案

Decision trees might be performing low because of many reasons, one prominent reason which I can think of is that while calculating a split they do not consider inter-dependency of variables or of target variable on other variables. 由于许多原因，决策树可能表现较低，我可以想到的一个突出原因是，在计算拆分时，他们不考虑变量或目标变量对其他变量的相互依赖性。 Before going into improving the performance, one should be aware that it shall not cause over-fitting and shall be able to generalize. 在改进性能之前，应该意识到它不会导致过度拟合并且应该能够推广。

To improve performance these few things can be done: 为了提高性能，可以完成以下几项工作：

Variable preselection: Different tests can be done like multicollinearity test, VIF calculation, IV calculation on variables to select only a few top variables. 变量预选：可以进行不同的测试，如多重共线性测试，VIF计算，变量IV计算，只选择几个顶部变量。 This will lead in improved performance as it would strictly cut out the undesired variables. 这将导致性能提高，因为它会严格删除不需要的变量。
Ensemble Learning Use multiple trees (random forests) to predict the outcomes. 集成学习使用多个树（随机森林）来预测结果。 Random forests in general perform well than a single decision tree as they manage to reduce both bias and variance. 随机森林通常比单一决策树表现良好，因为它们设法减少偏差和方差。 They are less prone to overfitting as well. 它们也不太容易过度拟合。
K-Fold cross validation: Cross validation in the training data itself can improve the performance of the model a bit. K-fold交叉验证：训练数据本身的交叉验证可以稍微提高模型的性能。
Hybrid Model: Use a hybrid model, ie use logistic regression after using decision trees to improve performance. 混合模型：使用混合模型，即在使用决策树后使用逻辑回归来提高性能。

I guess the more important question here is what's a good accuracy for the given domain: if you're classifying spam then 90% might be a bit low, but if you're predicting stock prices then 90% is really high! 我想这里更重要的问题是给定域名的准确性是多少：如果你对垃圾邮件进行分类，那么90％可能有点低，但如果你预测股票价格，那么90％真的很高！

If you're doing this on a known domain set and there are previous examples of classification accuracy which is higher than yours, then you can try several things: 如果您在已知的域集上执行此操作，并且之前的分类准确性示例高于您的，则可以尝试以下几种方法：

K-Fold Cross Validation K折交叉验证
Ensamble Learning Ensamble学习
Generalized Iterative Scaling (GIS) 广义迭代缩放（GIS）
Logistic Regression Logistic回归

I don't think you should improve this, may be the data is overfitted by the classifier. 我不认为你应该改进这个，可能是数据被分类器过度装配了。 Try to use another data sets, or cross-validation to see the more accurate result. 尝试使用其他数据集或交叉验证来查看更准确的结果。

By the way, 90%, if not overfitted, is great result, may be you even don't need to improve it. 顺便说一句，90％，如果没有过度装配，是很好的结果，可能你甚至不需要改进它。

You could look into pruning the leaves to improve the generalization of the decision tree. 您可以考虑修剪树叶以改进决策树的泛化。 But as was mentioned, 90% accuracy can be considered quite good.. 但正如所提到的，90％的准确度可以被认为是相当不错的..

90% is good or bad, depends on the domain of the data. 90％是好还是坏，取决于数据的域。

However, it might be that the classes in your data are overlapping and you can't really do more than 90%. 但是，您的数据中的类可能是重叠的，并且您实际上不能超过90％。

You can try to look in what nodes are the errors, and check if it's possible to improve the classification by changing them. 您可以尝试查看哪些节点是错误，并检查是否可以通过更改它们来改进分类。

You can also try Random Forest. 您也可以尝试随机森林。