简体   繁体   English

Matlab中的整体分类器

[英]ensemble classifier in matlab

I want to use ensemble classifiers for classification of 300 samples (15 positive samples and 285 negative samples, it means binary classification). 我想使用集合分类器对300个样本进行分类(15个阳性样本和285个阴性样本,这意味着二进制分类)。 I extracted 18 features from these samples, all of them are numerical, and there is some correlation between the features. 我从这些样本中提取了18个特征,所有特征都是数值的,并且这些特征之间存在一定的相关性。 I am new in MATLAB, and I tried using “fitensemble “ but I don't know which method to use: 'AdaBoostM1', 'LogitBoost', 'GentleBoost', 'RobustBoost' ,' Bag' or 'Subspace'. 我是MATLAB新手,曾尝试使用“ fitensemble”,但我不知道使用哪种方法:“ AdaBoostM1”,“ LogitBoost”,“ GentleBoost”,“ RobustBoost”,“ Bag”或“ Subspace”。 As the numbers of features is 18, I don't know weather boosting algorithms can help me or not. 由于功能数量为18,因此我不知道天气提升算法能否为我提供帮助。 On the other hand, I have problems with the number of the learners. 另一方面,我的学习者人数有问题。 How many learners are suitable for this problem, and I can get the optimal classification. 有多少学习者适合该问题,我可以获得最佳分类。 I would appreciate for your help. 多谢您的协助。

Off the top of my head I would say that an ensemble classifier is an overkill given that you only have 15 positive samples and only 18 features. 我想说的是,由于您只有15个阳性样本和18个特征,因此综合分类器是一个过大的杀伤力。 For a data set this small, I would start with a k-nearest-neighbor classifier. 对于这么小的数据集,我将从k最近邻分类器开始。 If that doesn't work well, try a Support Vector Machine. 如果效果不佳,请尝试使用支持向量机。

I use ensemble techniques in many problems. 我在许多问题中使用合奏技术。 This database has data imbalance problem. 该数据库存在数据不平衡问题。 When you have data imbalance with very few number of samples ... Using machine learning will be tricky. 当数据不平衡且样本数量很少时...使用机器学习会很棘手。 Svm is very sensitive to data imbalance problem. Svm对数据不平衡问题非常敏感。 If you use single svm you need to change the cost of missing samples from different classes. 如果使用单个svm,则需要更改不同类中丢失样本的成本。

If you use matlab functions you will not have full control. 如果您使用Matlab函数,您将无法完全控制。 That your problem should be also multi-feature. 那您的问题也应该是多功能的。

Start with using bagging technique: base learners can be svm, with down sampling of the major class. 从使用装袋技术开始:基础学习者可以是svm,可以对主要课程进行下采样。 Use all samples from the minor class and 15 samples from the major class . 使用次要课程的所有样本和主要课程的15个样本。 The samples from the major class can be randomly sampled each time you train a base learner. 每次培训基础学习者时,都可以随机抽取主要班级的样本。

The number of base learner to be used can be : the number of samples in major class / the number of samples in minor class. 使用的基础学习者的数量可以是:大类的样本数量/小类的样本数量。

For testing : test the sample by using all base learners and find the average. 测试:通过使用所有基础学习者来测试样本并找到平均值。

If the accuracy is not high . 如果精度不高。 This mean the classification problem is difficult. 这意味着分类问题很困难。 And it is better to use adaboost. 并且最好使用adaboost。 There is a good article about using adaboost: 有一篇关于使用adaboost的好文章:

viola jones adaboost 中提琴·琼斯·阿达博斯特

This adaboost is very good. 这个adaboost非常好。 Will take care of data imbalance and feature selection 将处理数据不平衡和功能选择

I think that you should try to get at least something like 100 observations of each class if possible. 我认为,如果可能的话,您应该尝试至少获得每个类的100个观察值。 It will make the hyperparameter optimization more robust too. 这也将使超参数优化更加可靠。 That would help to find how many learners are suitable and also which method is the best. 这将有助于找到适合的学习者人数,以及哪种方法是最好的。 But from my limited experience it does not make a huge different which of these different ensemble methods you use. 但是从我有限的经验来看,您使用的是哪种不同的集成方法,并没有太大的不同。 If you don't have more data you could loop over how many learners are suitable (10:5:300) and take a mean of the classification accuracy over 100 repetitions of random under-samplings of the majority class. 如果没有更多数据,则可以循环选择适合的学习者人数(10:5:300),并采用100次重复的多数类随机欠采样分类精度。

Here is some example code that you can build a loop around (using fitcensemble in R2016b or higher). 这是一些示例代码,您可以构建一个循环(使用R2016b或更高版本中的fitcensemble)。

switch classifierParameters.method{1}
case 'Bag'
    t = templateTree(   'MinLeafSize', classifierParameters.minLeafSize, ...
                        'MaxNumSplits', classifierParameters.maxNumSplits, ...
                        'SplitCriterion', classifierParameters.splitCriterion{1}, ...
                        'NumVariablesToSample', classifierParameters.numVariablesToSample);

    classificationEnsemble = fitcensemble(...
        predictors, ...
        response, ...
        'Learners', t, ...
        'Method', classifierParameters.method{1}, ...
        'NumLearningCycles', classifierParameters.numLearningCycles, ...
        'KFold',7); 

case {'AdaBoostM1','GentleBoost','LogitBoost'} 
    t = templateTree(  'MaxNumSplits', classifierParameters.maxNumSplits,...
                        'MinLeafSize', classifierParameters.minLeafSize);  
                        % Always 'SplitCriterion', 'mse' for Boosting

    classificationEnsemble = fitcensemble(...
        predictors, ...
        response, ...
        'Learners', t, ...
        'Method', classifierParameters.method{1}, ...
        'NumLearningCycles',classifierParameters.numLearningCycles,...
        'KFold',7,...
        'LearnRate',classifierParameters.learnRate);

case 'OptimizeHyperparameters'
    strct = struct( 'KFold', 10, 'Verbose',1, 'MaxObjectiveEvaluations',1000, 'SaveIntermediateResults', true, ...
                    'Repartition',false);

    classificationEnsemble = fitcensemble(...
        predictors, ...
        response, ...
        'OptimizeHyperparameters', 'all',... {'Method', 'LearnRate', 'MinLeafSize','MaxNumSplits','SplitCriterion', 'NumVariablesToSample'},...
        'HyperparameterOptimizationOptions', strct);

otherwise 
    error('Classification method not recognized')
end

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM