简体繁体 English

帮助 - LibSVM的准确率达到100％？

[英]Help--100% accuracy with LibSVM?

原文 2011-08-23 00:21:52 7 2 artificial-intelligence/ machine-learning/ computer-vision/ svm/ libsvm

Nominally a good problem to have, but I'm pretty sure it is because something funny is going on... 名义上是一个很好的问题，但我很确定这是因为有趣的东西正在发生......

As context, I'm working on a problem in the facial expression/recognition space, so getting 100% accuracy seems incredibly implausible (not that it would be plausible in most applications...). 作为背景，我正在处理面部表情/识别空间中的问题，因此获得100％的准确性似乎令人难以置信地难以置信（并非在大多数应用程序中都是合理的......）。 I'm guessing there is either some consistent bias in the data set that it making it overly easy for an SVM to pull out the answer, =or=, more likely, I've done something wrong on the SVM side. 我猜测数据集中存在一些一致的偏差，它使得SVM过于容易地得出答案，=或=，更可能的是，我在SVM方面做错了。

I'm looking for suggestions to help understand what is going on--is it me (=my usage of LibSVM)? 我正在寻找建议，以帮助了解发生了什么 - 是我（=我对LibSVM的使用）？ Or is it the data? 还是数据？

The details: 细节：

About ~2500 labeled data vectors/instances (transformed video frames of individuals--<20 individual persons total), binary classification problem. 大约~2500个标记的数据向量/实例（个体的变换视频帧 - 总共<20个人），二进制分类问题。 ~900 features/instance. ~900个功能/实例。 Unbalanced data set at about a 1:4 ratio. 不平衡数据的比率约为1：4。
Ran subset.py to separate the data into test (500 instances) and train (remaining). Ran subset.py将数据分成测试（500个实例）和训练（剩余）。
Ran "svm-train -t 0 ". 跑“svm-train -t 0”。 (Note: apparently no need for '-w1 1 -w-1 4'...) （注意：显然不需要'-w1 1 -w-1 4'......）
Ran svm-predict on the test file. 在测试文件上运行svm-predict。 Accuracy=100%! 准确度= 100％！

Things tried: 事情尝试：

Checked about 10 times over that I'm not training & testing on the same data files, through some inadvertent command-line argument error 通过一些无意的命令行参数错误检查了大约10次我没有训练和测试相同的数据文件
re-ran subset.py (even with -s 1) multiple times and did train/test only multiple different data sets (in case I randomly upon the most magical train/test pa 多次重新运行subset.py（甚至使用-s 1）并且仅训练/测试多个不同的数据集（如果我随机地在最神奇的火车/测试pa上
ran a simple diff-like check to confirm that the test file is not a subset of the training data 运行一个简单的diff-like检查以确认测试文件不是训练数据的子集
svm-scale on the data has no effect on accuracy (accuracy=100%). 数据上的svm-scale对精度没有影响（准确度= 100％）。 (Although the number of support vectors does drop from nSV=127, bSV=64 to nBSV=72, bSV=0.) （尽管支持向量的数量确实从nSV = 127下降，bSV = 64到nBSV = 72，bSV = 0。）
((weird)) using the default RBF kernel (vice linear -- ie, removing '-t 0') results in accuracy going to garbage(?!) （（奇怪））使用默认的RBF内核（副线性 - 即删除'-t 0'）导致准确转为垃圾（？！）
(sanity check) running svm-predict using a model trained on a scaled data set against an unscaled data set results in accuracy = 80% (ie, it always guesses the dominant class). （健全性检查）使用针对未缩放数据集的缩放数据集训练的模型运行svm-predict导致精度= 80％（即，它总是猜测主导类）。 This is strictly a sanity check to make sure that somehow svm-predict is nominally acting right on my machine. 这绝对是一个健全性检查，以确保svm-predict在某种程度上在我的机器上名义上正确行事。

Tentative conclusion?: 暂定结论？：

Something with the data is wacked--somehow, within the data set, there is a subtle, experimenter-driven effect that the SVM is picking up on. 数据集的某些东西已经被摧毁 - 不知何故，在数据集中，SVM正在汲取一种微妙的，实验者驱动的效果。

(This doesn't, on first pass, explain why the RBF kernel gives garbage results, however.) （首先，这不解释为什么RBF内核会产生垃圾结果。）

Would greatly appreciate any suggestions on a) how to fix my usage of LibSVM (if that is actually the problem) or b) determine what subtle experimenter-bias in the data LibSVM is picking up on. 非常感谢任何建议：a）如何修复我对LibSVM的使用（如果这实际上是问题）或b）确定LibSVM数据中的哪些微妙的实验者偏见正在接受。

2 个解决方案

Two other ideas: 另外两个想法：

Make sure you're not training and testing on the same data. 确保您没有对相同的数据进行培训和测试。 This sounds kind of dumb, but in computer vision applications you should take care that: make sure you're not repeating data (say two frames of the same video fall on different folds), you're not training and testing on the same individual, etc. It is more subtle than it sounds. 这听起来有点愚蠢，但在计算机视觉应用中，你应该注意：确保你没有重复数据（比如同一个视频的两个帧落在不同的折叠上），你不是在同一个人的训练和测试等等。它比听起来更微妙。

Make sure you search for gamma and C parameters for the RBF kernel. 确保搜索RBF内核的gamma和C参数。 There are good theoretical (asymptotic) results that justify that a linear classifier is just a degenerate RBF classifier. 有很好的理论（渐近）结果证明线性分类器只是简并RBF分类器。 So you should just look for a good (C, gamma) pair. 所以你应该只寻找一个好的（C，gamma）对。

Notwithstanding that the devil is in the details, here are three simple tests you could try: 尽管魔鬼在细节中，但您可以尝试三种简单的测试：

Quickie (~2 minutes): Run the data through a decision tree algorithm. 快速（约2分钟）：通过决策树算法运行数据。 This is available in Matlab via classregtree , or you can load into R and use rpart . 这可以通过classregtree在Matlab中classregtree ，或者您可以加载到R并使用rpart 。 This could tell you if one or just a few features happen to give a perfect separation. 这可以告诉您是否只有一个或几个功能可以实现完美分离。
Not-so-quickie (~10-60 minutes, depending on your infrastructure): Iteratively split the features (ie from 900 to 2 sets of 450), train, and test. 不那么快（约10-60分钟，取决于您的基础设施）：迭代分割功能（即从900到2组450），训练和测试。 If one of the subsets gives you perfect classification, split it again. 如果其中一个子集为您提供了完美的分类，请再次拆分。 It would take fewer than 10 such splits to find out where the problem variables are. 只需不到10个这样的拆分就可以找出问题变量的位置。 If it happens to "break" with many variables remaining (or even in the first split), select a different random subset of features, shave off fewer variables at a time, etc. It can't possibly need all 900 to split the data. 如果碰巧“中断”剩余许多变量（或者甚至在第一次拆分中），选择一个不同的随机特征子集，一次减少更少的变量，等等。它不可能需要全部900来分割数据。
Deeper analysis (minutes to several hours): try permutations of labels. 更深入的分析（几分钟到几个小时）：尝试标签的排列。 If you can permute all of them and still get perfect separation, you have some problem in your train/test setup. 如果您可以置换所有这些并且仍然可以完美分离，那么您的列车/测试设置就会出现问题。 If you select increasingly larger subsets to permute (or, if going in the other direction, to leave static), you can see where you begin to lose separability. 如果选择越来越大的子集进行置换（或者，如果向另一个方向移动，则保留静态），您可以看到开始失去可分性的位置。 Alternatively, consider decreasing your training set size and if you get separability even with a very small training set, then something is weird. 或者，考虑减少训练集的大小，如果即使使用非常小的训练集也能获得可分离性，那么有些东西很奇怪。

Method #1 is fast & should be insightful. 方法＃1很快，应该很有见地。 There are some other methods I could recommend, but #1 and #2 are easy and it would be odd if they don't give any insights. 我可以推荐一些其他方法，但＃1和＃2很容易，如果他们不提供任何见解会很奇怪。