简体繁体 English

减少 CNN (Conv1D) 文本分类模型中的误报

[英]Reducing false positive in CNN (Conv1D) text classification model

原文 2018-02-22 12:16:03 5 1 tensorflow/ machine-learning/ nlp/ keras/ convolution

I created a char-based CNN model for text classification on keras + tensorflow - mainly using Conv1D, mainly based on:我在keras + tensorflow上创建了一个基于char的CNN模型进行文本分类——主要使用Conv1D，主要基于：

http://www.wildml.com/2015/12/implementing-a-cnn-for-text-classification-in-tensorflow/ http://www.wildml.com/2015/12/implementing-a-cnn-for-text-classification-in-tensorflow/

The model is performing very good with 80%+ accuracy on test data set.该模型在测试数据集上表现非常好，准确率超过 80%。 However I'm having problem with false positive.但是，我遇到了误报问题。 One of the reason could be that the final layer is a Dense layer with softmax activation function.原因之一可能是最后一层是具有softmax激活函数的Dense层。

To give an idea of how the model is performing, I train the model with data set with 31 classes with 1021 samples, the performance is ~85% on 25% test data set为了了解模型的执行情况，我使用包含 31 个类和 1021 个样本的数据集训练模型，在 25% 的测试数据集上的性能约为 85%

However if you include false negative the performance is pretty bad (I didn't run another test data with false negative since it's pretty obvious just testing by hand) - every input has a corresponding prediction.但是，如果您包含假阴性，则性能非常糟糕（我没有运行另一个带有假阴性的测试数据，因为这很明显只是手动测试） - 每个输入都有相应的预测。 For example a sentence acasklncasdjsandjas can result in a class ask_promotion .例如，一个句子acasklncasdjsandjas可以产生一个类ask_promotion 。

Are there any best practice on how to deal with false positive in this case?在这种情况下，是否有关于如何处理误报的最佳实践？ My idea is to:我的想法是：

Implement a noise class where samples are just a set of totally random text.实现一个noise类，其中样本只是一组完全随机的文本。 However this doesn't seem to help since the noise doesn't contain any pattern thus it would be difficult to train the model然而，这似乎没有帮助，因为噪声不包含任何模式，因此很难训练模型
Replace softmax with something that doesn't require all output probability to 1 so small values can stay small regardless of other values.将 softmax 替换为不需要所有输出概率为 1 的东西，因此无论其他值如何，小值都可以保持较小。 I did some research on this but there's not much information on changing the activation function for this specific case我对此进行了一些研究，但关于更改此特定情况下的激活函数的信息并不多

1 个解决方案

That sounds like the issue of imbalanced data , where two classes have completely different supports (the number of instances in each class).这听起来像是数据不平衡的问题，其中两个类具有完全不同的支持（每个类中的实例数）。 This issue is particularly crucial in the task of hierarchical classification in which some classes with a deep hierarchy tend to have much more instances than the others.这个问题在层次分类任务中尤其重要，其中一些具有深层层次结构的类往往比其他类具有更多的实例。

Anyway, let's simply the issue as binary classification, and name the class with much more support Class-A and the other one with less support Class-B.无论如何，让我们简单地将问题作为二元分类，并将支持较多的类命名为 A 类，而将支持较少的类命名为 B 类。 Generally speaking, there are two popular ways to circumvent this issue.一般来说，有两种流行的方法可以规避这个问题。

Under-sampling: You fix Class-B as is.欠采样：按原样修复 B 类。 Then you sample instances from Class-A for the same amount as Class-B.然后您从 A 类中采样与 B 类相同数量的实例。 Combine these instances and train your classifier with them.组合这些实例并用它们训练您的分类器。
Over-sampling: You fix Class-A as is.过采样：按原样修复 A 类。 Then you sample instances from Class-B for the same amount as Class-A.然后您从 Class-B 中采样与 Class-A 相同数量的实例。 The same goes with Choice 1.选项 1 也是如此。