Machine Learning with an Unbalanced Dataset

Question

I've got the following unbalanced dataset containing wine ratings from 1-10. The class balance is the following:

Rating / # Samples (%)

1 - 0 (0.0%)
2 - 0 (0.0%)
3 - 10 (0.74%)
4 - 53 (3.90%)
5 - 577 (42.5%)
6 - 535 (39.40%)
7 - 167 (12.29%)
8 - 17 (1.25%)
9 - 0 (0.0%)
10 - 0 (0.0%)

Since I can't get more data, what is the best possible of predicting rates using Scikit-Learning with this unbalanced data? Can SMOTE be applied on this case?

Thanks all!

Answer 1

Predicting data values

Studying the dataset

We begin by studying the shape of the dataset, it may show some particular distribution:

import numpy as np
from matplotlib import pyplot as plt

data = np.array([(1, 0), (2, 0), (3, 0.74), (4, 3.90), (5, 42.5), (6, 39.40),
             (7, 12.29), (8, 1.25), (9, 0), (10, 0)])

x = data[:, 0]
y = data[:, 1] / 100 # We normalise the percentage points

plt.title("Wine ratings percentages")
plt.ylabel("Samples")
plt.xlabel("Ratings")
plt.plot(x, y, '.')
plt.plot(x, y)

The results are in:

Results interpretation

The distribution of the data, as could be expected from a 1-10 ratings dataset, is Binomial , a discrete Gaussian. This is called a Sampling distribution .

Predicting the values

Now we have an idea of the distribution of our dataset. We have now to predict the values of every class, making the assumption that it is truly Binomial.

A Binomial distribution hold two parameters: the number of elements n , in this scenario 10, and the probability of a trial, an extraction, usually called p . Since the mean of a Binomial is np , we can easily obtain p = mean/n .

mean = np.mean(x)
p = mean/10

The two values are n=10 and p=mean/10=0.010008 . We can use these as parameters to obtain the distribution these data would hold if they where full sets.

from scipy.stats import binom

my_binom = binom(10, p)
x_b = np.arange(0, 10+1)
y_b = my_binom.pmf(x_b)

plt.plot(x_b, y_b, '.')
plt.plot(x_b, y_b)

The predicted values

With this approach, the obtained values are the following:

predictions = [(0, 0.0003405062891601558), (1, 0.004161743534179685),
           (2, 0.02288958943798826), (3, 0.07460310631640629),
           (4, 0.15956775517675784), (5, 0.2340327075925782),
           (6, 0.2383666466220704), (7, 0.1664782928789064),
           (8, 0.07630255090283203), (9, 0.020724149627929712),
           (10, 0.0025329516211914063)]

Further notes

You can take this approach and explore it further, trying to identify a more specific distribution or, if you have more data related to other aspects of your model, applying Bayes Theorem to better fit the desired predictions.

Answer 2

As Vivek mentioned in his comment, you cannot do anything about the classes for which you have no data. As far As the remaining classes are concerned, some of them have too little samples. You could try class weights (available in sklearn) or under-sampling, but I doubt they would work well.

Time spent in obtaining more data for those classes is a good idea. IF that is not possible, maybe having two classifiers: one for the low-number classes and another for the other classes. You can use a third classifier for splitting a given instance into either of these two classes (basically a hierarchical classifier)

Machine Learning with an Unbalanced Dataset

Question

2 answers

solution1
2 2018-07-06 07:07:51

Predicting data values

Studying the dataset

Results interpretation

Predicting the values

The predicted values

Further notes

solution2
1 2018-07-06 06:10:17

Machine Learning with an Unbalanced Dataset

Question

2 answers

solution1 2 2018-07-06 07:07:51

Predicting data values

Studying the dataset

Results interpretation

Predicting the values

The predicted values

Further notes

solution2 1 2018-07-06 06:10:17

solution1
2 2018-07-06 07:07:51

solution2
1 2018-07-06 06:10:17