keras 的图像数据过采样

Question

I am working on Kaggle competition and trying to solve a multilabel classification problem with keras.我正在从事 Kaggle 比赛，并试图用 keras 解决多标签分类问题。

My dataset is highly imbalanced.我的数据集高度不平衡。 I am familiar with this concept and did it for simple machine learning datasets, but now sure how to deal with both images and csv data.我熟悉这个概念，并为简单的机器学习数据集做了它，但现在确定如何处理图像和 csv 数据。

There are a couple of questions, but they did not help me.有几个问题，但他们没有帮助我。

Use SMOTE to oversample image data 使用 SMOTE 对图像数据进行过采样

How to oversample image dataset using Python? 如何使用 Python 对图像数据集进行过采样？

Class
No finding            25462
Aortic enlargement     5738
Cardiomegaly           4345
Pleural thickening     3866
Pulmonary fibrosis     3726
Nodule/Mass            2085
Pleural effusion       1970
Lung Opacity           1949
Other lesion           1771
Infiltration            997
ILD                     792
Calcification           775
Consolidation           441
Atelectasis             229
Pneumothorax            185

I am trying to do oversampling, but not sure how to approach it.我正在尝试进行过采样，但不知道如何处理它。 I have 15000 png images and train.csv dataset, which looks like:我有 15000 张png图像和train.csv数据集，如下所示：

image_id    class_name  class_id    rad_id  x_min   y_min   x_max   y_max   width   height
0   50a418190bc3fb1ef1633bf9678929b3    No finding  14  R11 0.0 0.0 0.0 0.0 2332    2580
1   21a10246a5ec7af151081d0cd6d65dc9    No finding  14  R7  0.0 0.0 0.0 0.0 2954    3159
2   9a5094b2563a1ef3ff50dc5c7ff71345    Cardiomegaly    3   R10 691.0   1375.0  1653.0  1831.0  2080    2336
3   051132a778e61a86eb147c7c6f564dfe    Aortic enlargement  0   R10 1264.0  743.0   1611.0  1019.0  2304    2880
4   063319de25ce7edb9b1c6b8881290140    No finding  14  R10 0.0 0.0 0.0 0.0 2540    3072

How to attack this problem, when I have images and csv?当我有图像和 csv 时，如何解决这个问题？

When I converted data, it looks like:当我转换数据时，它看起来像：

                               Images               Class
56     d106ec9b305178f3da060efe3191499a.png         Nodule/Mass
38694  081d1700020b6bf0099f1e4d8aeec0f3.png        Lung Opacity
50141  ff8ef73390f04480aba0be7810ef94cf.png          No finding
233    253d35b7096d0957bd79cfb4b1c954e1.png          No finding
2166   1951e0eba7c68aa1fbd6d723f19ee7c4.png  Pleural thickening

I use image generator我使用图像生成器

# Create a train generator
train_generator = train_dataGen.flow_from_dataframe(dataframe = train,
                                                directory = 'my_directory', 
                                                x_col = 'Images',
                                                y_col = 'Class',
                                                class_mode = 'categorical',
                                                # target_size = (256, 256),
                                                batch_size = 32)

I tried something dumb, but obviously did not work.我尝试了一些愚蠢的方法，但显然没有用。

# Create an instance
oversample = SMOTE()

# Oversample
train_ovsm, valid_ovsm = oversample.fit_resample(train_ovsm, valid_ovsm)

Gives me an error:给我一个错误：

ValueError: could not convert string to float: '954984f75efe6890cfa45d0784a3a1e6.png'

Appreciate tips and good tutorials, cannot find anything so far.欣赏技巧和好的教程，到目前为止找不到任何东西。

Answer 1

I'm not sure if this answer satisfies you or not, but here is my thought.我不确定这个答案是否让你满意，但这是我的想法。 If I were you, I wouldn't try to balance it in the way you're trying it now.如果我是你，我不会像你现在尝试的那样尝试平衡它。 IMO, that's not the proper way. IMO，这不是正确的方法。 Your main concern is this VinBigData is highly imbalanced and you're not sure how to address it properly.您主要担心的是此VinBigData高度不平衡，您不确定如何正确解决它。

Here are some first approaches all would adopt to address this issue in this competition.以下是所有人在本次比赛中为解决这个问题而采取的一些初步方法。

- External dataset 
- Heavy and meaningful augmentation
- Modified the loss function

External Datasets外部数据集

NIH Chest X-rays: Data NIH 胸部 X 光片：数据
SIIM-ACR Pneumothorax Segmentation: Data SIIM-ACR 气胸分割：数据
OSIC Pulmonary Fibrosis Progression: Data OSIC 肺纤维化进展：数据
RSNA Pneumonia Detection Challenge: Data RSNA 肺炎检测挑战：数据
Chest X-Ray Images (Pneumonia): Data胸部 X 射线图像（肺炎）：数据

What you need to do, collect all possible external samples from these datasets, combine them and make new datasets.您需要做的是，从这些数据集中收集所有可能的外部样本，将它们组合起来并制作新的数据集。 It may take time but it worth it.这可能需要时间，但值得。

Medical Image Augmentation医学图像增强

We all know augmentation is one of the key strategies for deep learning model training.我们都知道增强是深度学习 model 训练的关键策略之一。 But it would make sense to choose the right augmentation.但是选择正确的增强是有意义的。 Here are some demonstrations. 这里有一些演示。 The main intuition is to try not to destroy sensitive information.主要的直觉是尽量不破坏敏感信息。 Be careful on that.对此要小心。

Class Loss Weighting Class 损失加权

You can modify the loss function to weight the predicted score.您可以修改损失 function 以加权预测得分。 Here is a detailed explanation of this topic.这是这个主题的详细解释。

keras 的图像数据过采样

问题描述

1 个解决方案

解决方案1
1 已采纳 2021-02-28 09:35:40

External Datasets外部数据集

Medical Image Augmentation医学图像增强

Class Loss Weighting Class 损失加权

keras 的图像数据过采样

问题描述

1 个解决方案

解决方案1 1 已采纳 2021-02-28 09:35:40

External Datasets外部数据集

Medical Image Augmentation医学图像增强

Class Loss Weighting Class 损失加权

解决方案1
1 已采纳 2021-02-28 09:35:40