简体   繁体   中英

ensemble classifier in matlab

I want to use ensemble classifiers for classification of 300 samples (15 positive samples and 285 negative samples, it means binary classification). I extracted 18 features from these samples, all of them are numerical, and there is some correlation between the features. I am new in MATLAB, and I tried using “fitensemble “ but I don't know which method to use: 'AdaBoostM1', 'LogitBoost', 'GentleBoost', 'RobustBoost' ,' Bag' or 'Subspace'. As the numbers of features is 18, I don't know weather boosting algorithms can help me or not. On the other hand, I have problems with the number of the learners. How many learners are suitable for this problem, and I can get the optimal classification. I would appreciate for your help.

Off the top of my head I would say that an ensemble classifier is an overkill given that you only have 15 positive samples and only 18 features. For a data set this small, I would start with a k-nearest-neighbor classifier. If that doesn't work well, try a Support Vector Machine.

I use ensemble techniques in many problems. This database has data imbalance problem. When you have data imbalance with very few number of samples ... Using machine learning will be tricky. Svm is very sensitive to data imbalance problem. If you use single svm you need to change the cost of missing samples from different classes.

If you use matlab functions you will not have full control. That your problem should be also multi-feature.

Start with using bagging technique: base learners can be svm, with down sampling of the major class. Use all samples from the minor class and 15 samples from the major class . The samples from the major class can be randomly sampled each time you train a base learner.

The number of base learner to be used can be : the number of samples in major class / the number of samples in minor class.

For testing : test the sample by using all base learners and find the average.

If the accuracy is not high . This mean the classification problem is difficult. And it is better to use adaboost. There is a good article about using adaboost:

viola jones adaboost

This adaboost is very good. Will take care of data imbalance and feature selection

I think that you should try to get at least something like 100 observations of each class if possible. It will make the hyperparameter optimization more robust too. That would help to find how many learners are suitable and also which method is the best. But from my limited experience it does not make a huge different which of these different ensemble methods you use. If you don't have more data you could loop over how many learners are suitable (10:5:300) and take a mean of the classification accuracy over 100 repetitions of random under-samplings of the majority class.

Here is some example code that you can build a loop around (using fitcensemble in R2016b or higher).

switch classifierParameters.method{1}
case 'Bag'
    t = templateTree(   'MinLeafSize', classifierParameters.minLeafSize, ...
                        'MaxNumSplits', classifierParameters.maxNumSplits, ...
                        'SplitCriterion', classifierParameters.splitCriterion{1}, ...
                        'NumVariablesToSample', classifierParameters.numVariablesToSample);

    classificationEnsemble = fitcensemble(...
        predictors, ...
        response, ...
        'Learners', t, ...
        'Method', classifierParameters.method{1}, ...
        'NumLearningCycles', classifierParameters.numLearningCycles, ...
        'KFold',7); 

case {'AdaBoostM1','GentleBoost','LogitBoost'} 
    t = templateTree(  'MaxNumSplits', classifierParameters.maxNumSplits,...
                        'MinLeafSize', classifierParameters.minLeafSize);  
                        % Always 'SplitCriterion', 'mse' for Boosting

    classificationEnsemble = fitcensemble(...
        predictors, ...
        response, ...
        'Learners', t, ...
        'Method', classifierParameters.method{1}, ...
        'NumLearningCycles',classifierParameters.numLearningCycles,...
        'KFold',7,...
        'LearnRate',classifierParameters.learnRate);

case 'OptimizeHyperparameters'
    strct = struct( 'KFold', 10, 'Verbose',1, 'MaxObjectiveEvaluations',1000, 'SaveIntermediateResults', true, ...
                    'Repartition',false);

    classificationEnsemble = fitcensemble(...
        predictors, ...
        response, ...
        'OptimizeHyperparameters', 'all',... {'Method', 'LearnRate', 'MinLeafSize','MaxNumSplits','SplitCriterion', 'NumVariablesToSample'},...
        'HyperparameterOptimizationOptions', strct);

otherwise 
    error('Classification method not recognized')
end

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM