简体   繁体   English

使用 Tensorflow Dataset 和 Keras Tuner 处理高度不平衡的数据集

[英]Dealing with highly imbalanced datasets using Tensorflow Dataset and Keras Tuner

I have a highly imbalanced dataset (3% Yes, 87% No) of textual documents, containing a title and abstract feature.我有一个高度不平衡的文本文档数据集(3% 是,87% 否),其中包含标题和摘要特征。 I have transformed these documents into tf.data.Dataset entities with padded batches.我已将这些文档转换为带有填充批次的tf.data.Dataset实体。 Now, I am trying to train this dataset using Deep Learning.现在,我正在尝试使用深度学习来训练这个数据集。 With model.fit() in TensorFlow, you have the class_weights parameter to deal with class imbalance, however, I am seeking for the best parameters using keras-tuner library.使用 TensorFlow 中的model.fit() ,您可以使用class_weights参数来处理类不平衡问题,但是,我正在使用keras-tuner库寻找最佳参数。 In their hyperparameter tuners, they do not have such an option.在他们的超参数调谐器中,他们没有这样的选项。 Therefore, I am seeking other options for dealing with class imbalance.因此,我正在寻找其他选择来处理阶级不平衡问题。

Is there an option to use class weights in keras-tuner ?是否可以选择在keras-tuner使用类权重? To add, I am already using the precision@recall metric.另外,我已经在使用precision@recall指标。 I could also try a data resampling method, such as imblearn.over_sampling.SMOTE , but as this Kaggle post mentions:我也可以尝试数据重采样方法,例如imblearn.over_sampling.SMOTE ,但正如这篇Kaggle 帖子所提到的:

It appears that SMOTE does not help improve the results. SMOTE 似乎无助于改善结果。 However, it makes the network learning faster.但是,它使网络学习更快。 Moreover, there is one big problem, this method is not compatible larger datasets.此外,还有一个大问题,这种方法不兼容更大的数据集。 You have to apply SMOTE on embedded sentences, which takes way too much memory.您必须对嵌入的句子应用 SMOTE,这会占用太多内存。

if you are looking for other methods to deal with imbalanced data, you may consider generating synthetic data using SMOTE or ADASYN package.如果您正在寻找其他方法来处理不平衡数据,您可以考虑使用 SMOTE 或 ADASYN 包生成合成数据。 This usually works.这通常有效。 I see you have considered this as an option to explore.我看到您已将此视为探索的选项。

You could change the evaluation metric to fbeta_scorer.(its weighted fscore)您可以将评估指标更改为 fbeta_scorer。(其加权 fscore)

Or if the dataset is large enough, you can try undersampling.或者如果数据集足够大,您可以尝试欠采样。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM