简体   繁体   中英

How to improve accuracy of decision tree in matlab

I have a set of data which I classify them in matlab using decision tree. I divide the set into two parts; one training data(85%) and the other test data(15%). The problem is that the accuracy is around %90 and I do not know how I can improve it. I would appreciate if you have any idea about it.

Decision trees might be performing low because of many reasons, one prominent reason which I can think of is that while calculating a split they do not consider inter-dependency of variables or of target variable on other variables. Before going into improving the performance, one should be aware that it shall not cause over-fitting and shall be able to generalize.

To improve performance these few things can be done:

  • Variable preselection: Different tests can be done like multicollinearity test, VIF calculation, IV calculation on variables to select only a few top variables. This will lead in improved performance as it would strictly cut out the undesired variables.

  • Ensemble Learning Use multiple trees (random forests) to predict the outcomes. Random forests in general perform well than a single decision tree as they manage to reduce both bias and variance. They are less prone to overfitting as well.

  • K-Fold cross validation: Cross validation in the training data itself can improve the performance of the model a bit.

  • Hybrid Model: Use a hybrid model, ie use logistic regression after using decision trees to improve performance.

I guess the more important question here is what's a good accuracy for the given domain: if you're classifying spam then 90% might be a bit low, but if you're predicting stock prices then 90% is really high!

If you're doing this on a known domain set and there are previous examples of classification accuracy which is higher than yours, then you can try several things:

I don't think you should improve this, may be the data is overfitted by the classifier. Try to use another data sets, or cross-validation to see the more accurate result.

By the way, 90%, if not overfitted, is great result, may be you even don't need to improve it.

You could look into pruning the leaves to improve the generalization of the decision tree. But as was mentioned, 90% accuracy can be considered quite good..

90% is good or bad, depends on the domain of the data.

However, it might be that the classes in your data are overlapping and you can't really do more than 90%.

You can try to look in what nodes are the errors, and check if it's possible to improve the classification by changing them.

You can also try Random Forest.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM