简体繁体 English

使用libsvm提高我的归一化精度的建议

[英]Suggestions to improve my normalized accuracy with libsvm

原文 2013-11-08 10:54:28 8 1 machine-learning/ artificial-intelligence/ svm/ libsvm

I'm with a problem when I try to classify my data using libsvm. 当我尝试使用libsvm对数据进行分类时遇到问题。 My training and test data are highly unbalanced. 我的培训和测试数据非常不平衡。 When I do the grid search for the svm parameters and train my data with weights for the classes, the testing gives the accuracy of 96.8113%. 当我在网格上搜索svm参数并使用类别的权重训练我的数据时，测试得出的准确度为96.8113％。 But because the testing data is unbalanced, all the correct predicted values are from the negative class, which is larger than the positive class. 但是，由于测试数据不平衡，因此所有正确的预测值都来自负面类别，该类别大于正面类别。

I tried a lot of things, from changing the weights until changing the gamma and cost values, but my normalized accuracy (which takes into account the positive classes and negative classes) is lower in each try. 从更改权重到更改伽玛值和成本值，我做了很多尝试，但是每次尝试时我的归一化精度（考虑到正数类和负数类）都较低。 Training 50% of positives and 50% of negatives with the default grid.py parameters i have a very low accuracy (18.4234%). 使用默认的grid.py参数训练50％的正值和50％的负值的准确性非常低（18.4234％）。

I want to know if the problem is in my description (how to build the feature vectors), in the unbalancing (should i use balanced data in another way?) or should i change my classifier? 我想知道问题出在我的描述中（如何构建特征向量），不平衡中（我应该以其他方式使用平衡数据吗？）还是应该更改分类器？

1 个解决方案

Better data always helps. 更好的数据总是有帮助的。

I think that imbalance is part of the problem. 我认为不平衡是问题的一部分。 But a more significant part of the problem is how you're evaluating your classifier. 但是，问题中最重要的部分是您如何评估分类器。 Evaluating accuracy given the distribution of positives and negatives in your data is pretty much useless. 考虑到数据中正负分布的准确性，几乎没有用。 So is training on 50% and 50% and testing on data that is distributed 99% vs 1%. 因此，对50％和50％进行培训，并对分布在99％对1％的数据进行测试。

There are problems in real life that are like the one your studying (that have a great imbalance in positives to negatives). 现实生活中存在一些问题，就像您正在研究的问题一样（正负之间存在很大的失衡）。 Let me give you two examples: 让我举两个例子：

Information retrieval: given all documents in a huge collection return the subset that are relevant to search term q. 信息检索：给定巨大集合中的所有文档，返回与搜索项q相关的子集。
Face detection: this large image mark all locations where there are human faces. 人脸检测：此大图像标记了人脸所在的所有位置。

Many approaches to these type of systems are classifier-based. 这些类型的系统的许多方法都是基于分类器的。 To evaluate two classifiers two tools are commonly used: ROC curves , Precision Recall curves and the F-score . 要评估两个分类器，通常使用两个工具： ROC曲线， Precision Recall曲线和F分数。 These tools give a more principled approach to evaluate when one classifier is working better than the another. 这些工具提供了一种更原则性的方法来评估一个分类器何时比另一个分类器更好地工作。