简体   繁体   中英

Machine Learning with an Unbalanced Dataset

I've got the following unbalanced dataset containing wine ratings from 1-10. The class balance is the following:

Rating / # Samples (%)

  • 1 - 0 (0.0%)
  • 2 - 0 (0.0%)
  • 3 - 10 (0.74%)
  • 4 - 53 (3.90%)
  • 5 - 577 (42.5%)
  • 6 - 535 (39.40%)
  • 7 - 167 (12.29%)
  • 8 - 17 (1.25%)
  • 9 - 0 (0.0%)
  • 10 - 0 (0.0%)

Since I can't get more data, what is the best possible of predicting rates using Scikit-Learning with this unbalanced data? Can SMOTE be applied on this case?

Thanks all!

Predicting data values

Studying the dataset

We begin by studying the shape of the dataset, it may show some particular distribution:

import numpy as np
from matplotlib import pyplot as plt

data = np.array([(1, 0), (2, 0), (3, 0.74), (4, 3.90), (5, 42.5), (6, 39.40),
             (7, 12.29), (8, 1.25), (9, 0), (10, 0)])

x = data[:, 0]
y = data[:, 1] / 100 # We normalise the percentage points

plt.title("Wine ratings percentages")
plt.ylabel("Samples")
plt.xlabel("Ratings")
plt.plot(x, y, '.')
plt.plot(x, y)

The results are in:

葡萄酒评级百分比

Results interpretation

The distribution of the data, as could be expected from a 1-10 ratings dataset, is Binomial , a discrete Gaussian. This is called a Sampling distribution .

Predicting the values

Now we have an idea of the distribution of our dataset. We have now to predict the values of every class, making the assumption that it is truly Binomial.

A Binomial distribution hold two parameters: the number of elements n , in this scenario 10, and the probability of a trial, an extraction, usually called p . Since the mean of a Binomial is np , we can easily obtain p = mean/n .

mean = np.mean(x)
p = mean/10

The two values are n=10 and p=mean/10=0.010008 . We can use these as parameters to obtain the distribution these data would hold if they where full sets.

from scipy.stats import binom

my_binom = binom(10, p)
x_b = np.arange(0, 10+1)
y_b = my_binom.pmf(x_b)

plt.plot(x_b, y_b, '.')
plt.plot(x_b, y_b)

二项分布

The predicted values

With this approach, the obtained values are the following:

predictions = [(0, 0.0003405062891601558), (1, 0.004161743534179685),
           (2, 0.02288958943798826), (3, 0.07460310631640629),
           (4, 0.15956775517675784), (5, 0.2340327075925782),
           (6, 0.2383666466220704), (7, 0.1664782928789064),
           (8, 0.07630255090283203), (9, 0.020724149627929712),
           (10, 0.0025329516211914063)]

预测表

Further notes

You can take this approach and explore it further, trying to identify a more specific distribution or, if you have more data related to other aspects of your model, applying Bayes Theorem to better fit the desired predictions.

As Vivek mentioned in his comment, you cannot do anything about the classes for which you have no data. As far As the remaining classes are concerned, some of them have too little samples. You could try class weights (available in sklearn) or under-sampling, but I doubt they would work well.

Time spent in obtaining more data for those classes is a good idea. IF that is not possible, maybe having two classifiers: one for the low-number classes and another for the other classes. You can use a third classifier for splitting a given instance into either of these two classes (basically a hierarchical classifier)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM