简体   繁体   English

深度学习不平衡的数据集

[英]Deep Learning an Imbalanced data set

I have two data sets that looks like this: 我有两个看起来像这样的数据集:

DATASET 1
Training (Class 0: 8982, Class 1: 380)
Testing (Class 0: 574, Class 1: 12)

DATASET 2
Training (Class 0: 8982, Class 1: 380)
Testing (Class 0: 574, Class 1: 8)

I am trying to build a deep feedforward neural net in Tensorflow. 我正在尝试在Tensorflow中构建一个深度前馈神经网络。 I get accuracies in the 90s and AUC scores in the 80s. 我在80年代获得了准确度,在80年代获得了AUC分数。 Of course, the data set is heavily imbalanced so those metrics are useless. 当然,数据集严重失衡,因此这些指标毫无用处。 My emphasis is on getting a good recall value and I do not want to oversample the Class 1. I have toyed with the complexity of the model to no avail, the best model predicted only 25% of the positive class correctly. 我的重点是获得良好的召回价值,我不想对第1类进行过度抽样。我玩弄了模型的复杂性无济于事,最好的模型只能正确地预测正面类别的25%。

My question is, considering the distribution of these data sets, is it a futile move to build models without getting more data(I can't get more data) or there's a way around getting to work with data that is this much imbalanced. 我的问题是,考虑到这些数据集的分布,在没有获得更多数据的情况下构建模型是徒劳的(我无法获得更多数据),或者有办法处理这种非常不平衡的数据。

Thanks! 谢谢!

Question

Can I use tensorflow to learn imbalance classification with a ratio of about 30:1 我可以使用tensorflow来学习不平衡分类,比例约为30:1

Answer 回答

Yes, and I have. 是的,我有。 Specifically Tensorflow provides the ability to feed in a weight matrix. 特别是Tensorflow提供了输入权重矩阵的能力。 Look at tf.losses.sigmoid_cross_entropy, there is a weights parameter. 看看tf.losses.sigmoid_cross_entropy,有一个权重参数。 You can feed in a matrix that matches Y in shape and for each value of Y provide the relative weight that training example should have. 可以在其形状和用于Y的每个值提供相对重量培训例如应具有匹配Y上矩阵饲料。

One way to find the correct weights is to start different balances and run your training and then look at your confusion matrix and a run down of precision vs accuracy for each class. 找到正确权重的一种方法是启动不同的平衡并运行训练,然后查看您的混淆矩阵以及每个班级的精确度与准确度之间的差异。 Once you get both classes to have about the same precision to accuracy ratio then they are balanced. 一旦你让两个类具有大约相同的精度与准确度比,那么它们是平衡的。

Example Implementation 示例实现

Here is an example implementation that converts a Y into a weight matrix that has performed very well for me 这是一个示例实现,它将Y转换为对我来说表现非常好的权重矩阵

def weightMatrix( matrix , most=0.9 ) :
    b = np.maximum( np.minimum( most , matrix.mean(0) ) , 1. - most )
    a = 1./( b * 2. )
    weights = a * ( matrix + ( 1 - matrix ) * b / ( 1 - b ) )
    return weights

The most parameter represents the largest fractional difference to consider. 最大的参数代表要考虑的最大分数差异。 0.9 equates to .1:.9 = 1:9 , where as .5 equates to 1:1. 0.9等于.1:.9 = 1:9,其中.5等于1:1。 Values below .5 don't work. 低于.5的值不起作用。

You might be interested to have a look at this question and its answer. 您可能有兴趣看一下这个问题及其答案。 Its scope is a priori more restricted than yours, as it addresses specifically weights for classification, but it seems very relevant to your case. 它的范围是先验的,比你的更受限制,因为它解决了分类的特定权重,但它似乎与你的情况非常相关。

Also, AUC is definitely not irrelevant: it is actually independent of your data imbalance. 此外, AUC绝对不是无关紧要的:它实际上与您的数据不平衡无关。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM