简体   繁体   中英

several category classification with Keras

Say I have 6 different categories I'm trying to classify my data into using a NN. How important is it to training that I have an equal number of instances for each class? Presently I have like 50k for one class, 6k for another, 300 for another.. you get the picture. How big of a problem is this? I'm thinking I might nix some of the classes with low representation, but I'm not sure what a good cutoff would be, or if it would really be important.

Imbalanced data is generally a problem for machine learning. Particularly when the classes are severely imbalanced (such as in your case). In a nutshell, the algorithm wont be able to learn the right associations between the features and the categories for all classes. It will most likely miss the rules and or rely too much on the majority class(es). Have a look at the imblearn package. General solutions for imbalanced data are to either:

  1. Downsample the majority class (reduce the number of samples/instances in the majority class to match one of the minority classes).
  2. Upsample the minority classes (look for SMOTE / synthetic minority oversampling technique. This increases the number of samples in the minority classes to match some number (eg the majority class).
  3. A combination of both.
  4. Drop classes with very very low representation (not the best idea, but justifiable in some cases). 300 might still be usable if you upsample, but it probably isnt ideal.

Other considerations include changing your performance metric to include precision/recall rather than accuracy (for example).

This link should provide some further examples that might be helpful

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM