为卷积神经网络中的不平衡数据集添加类权重

Question

I have a dataset of images that has the following distribution:我有一个具有以下分布的图像数据集：

Class 0: 73,5% 0 级：73.5%
Class 1: 7%第一类：7%
Class 2: 15%第 2 类：15%
Class 3: 2,5%第 3 类：2.5%
Class 4: 2%第 4 类：2%

I think I need to add Class Weights to make up for the low amount of images in class 1, 2, 3 and 4.我想我需要添加类别权重来弥补类别 1、2、3 和 4 中的少量图像。

I have tried calculating the class weights by dividing class 0 with class 1, class 0 with class 2 and so forth.我尝试通过将 0 类除以 1 类，将 0 类除以 2 类等来计算类权重。

I'm assuming that class 0 corresponds to 1, as it doesnt need to be scaled?我假设类 0 对应于 1，因为它不需要缩放？ Not sure if that is correct though.不确定这是否正确。

class_weights = np.array([1, 10.5, 4.9, 29.4, 36.75])

and added them to my fit function:并将它们添加到我的拟合函数中：

model.fit(x_train, y_train, batch_size=batch_size, class_weight=class_weights, epochs=epochs, validation_data=(x_test, y_test))

I'm unsure if I have calculated the weights correctly, and if this is even how it is supposed to be done?我不确定我是否正确计算了权重，是否应该这样做？

Hopefully anyone can help clarifying it.希望任何人都可以帮助澄清它。

Answer 1

First of all make sure to pass a dictionary since the class_weights parameter takes a dictionary.首先确保传递字典，因为class_weights参数采用字典。

Second, the point of weighting the classes is as follows.其次，对类进行加权的要点如下。 Lets say that you have a binary classification problem where class_1 has 1000 instances and class_2 100 instances.假设您有一个二元分类问题，其中class_1有 1000 个实例， class_2有 100 个实例。 Since you wanna make up for the imbalanced data you can set the weights as:由于您想弥补不平衡的数据，您可以将权重设置为：

class_weights={"class_1": 1, "class_2": 10}

In other words, this would mean that if the model makes a mistake where the true label is class_2 it is going to be penalized 10 times more than if it makes a mistake on a sample where the true class is class_1 .换句话说，这意味着如果模型在真实标签为class_2的地方犯了错误，那么它所受到的惩罚是在真实类别为class_1的样本上犯错误的 10 倍。 You want to have something like this because given the class distribution in the data, the model will have an inherent tendency of overfitting on the class_1 since it is overpopulated by default.你想要这样的东西是因为给定数据中的类分布，模型将有一种固有的过度拟合class_1的趋势，因为它在默认情况下人口过多。 By setting the class weights you are imposing an implicit constraint on the model that it is equally bad to make a wrong prediction on 10 instances of the class_1 and 1 wrong prediction on an instance of the class_2 .通过设置类权重，您对模型施加了隐式约束，即对class_1的 10 个实例进行错误预测和对class_2的实例进行 1 个错误预测同样糟糕。

With that said, you can set the class_weights anyhow you want meaning that there is no right or wrong way to do it.话虽如此，您可以随心所欲地设置class_weights ，这意味着没有正确或错误的方法。 The way you set the weights seems reasonable to me.你设置权重的方式对我来说似乎是合理的。

Answer 2

Please visit this answer for a proper solution https://datascience.stackexchange.com/a/18722请访问此答案以获得正确的解决方案https://datascience.stackexchange.com/a/18722

I understand that you are trying to set class weights, but also consider image augmentation to generate more images for the underrepresented classes.我知道您正在尝试设置类别权重，但也考虑图像增强以为代表性不足的类别生成更多图像。

Answer 3

I solved the problem, thank you so much gorjan.我解决了这个问题，非常感谢 gorjan。

class_weight = {0: 1.0,
            1: 10.5,
            2: 4.8,
            3: 29.5,
            4: 36.4}

Instead of typing for example "0" or "1" around classname, it was without the "" that did the trick:-) and to use the dict as you suggested instead of the np array.不是在类名周围输入例如“0”或“1”，而是没有“”的技巧:-)并按照您的建议使用字典而不是 np 数组。

为卷积神经网络中的不平衡数据集添加类权重

问题描述

3 个解决方案

解决方案1
8 2018-12-20 00:15:20

解决方案2
2 2020-03-17 06:00:50

解决方案3
0 2018-12-20 01:38:25

为卷积神经网络中的不平衡数据集添加类权重

问题描述

3 个解决方案

解决方案1 8 2018-12-20 00:15:20

解决方案2 2 2020-03-17 06:00:50

解决方案3 0 2018-12-20 01:38:25

解决方案1
8 2018-12-20 00:15:20

解决方案2
2 2020-03-17 06:00:50

解决方案3
0 2018-12-20 01:38:25