I've got the following unbalanced dataset containing wine ratings from 1-10. The class balance is the following:
Rating / # Samples (%)
Since I can't get more data, what is the best possible of predicting rates using Scikit-Learning with this unbalanced data? Can SMOTE be applied on this case?
Thanks all!
We begin by studying the shape of the dataset, it may show some particular distribution:
import numpy as np
from matplotlib import pyplot as plt
data = np.array([(1, 0), (2, 0), (3, 0.74), (4, 3.90), (5, 42.5), (6, 39.40),
(7, 12.29), (8, 1.25), (9, 0), (10, 0)])
x = data[:, 0]
y = data[:, 1] / 100 # We normalise the percentage points
plt.title("Wine ratings percentages")
plt.ylabel("Samples")
plt.xlabel("Ratings")
plt.plot(x, y, '.')
plt.plot(x, y)
The results are in:
The distribution of the data, as could be expected from a 1-10 ratings dataset, is Binomial , a discrete Gaussian. This is called a Sampling distribution .
Now we have an idea of the distribution of our dataset. We have now to predict the values of every class, making the assumption that it is truly Binomial.
A Binomial distribution hold two parameters: the number of elements n , in this scenario 10, and the probability of a trial, an extraction, usually called p . Since the mean of a Binomial is np , we can easily obtain p = mean/n .
mean = np.mean(x)
p = mean/10
The two values are n=10 and p=mean/10=0.010008 . We can use these as parameters to obtain the distribution these data would hold if they where full sets.
from scipy.stats import binom
my_binom = binom(10, p)
x_b = np.arange(0, 10+1)
y_b = my_binom.pmf(x_b)
plt.plot(x_b, y_b, '.')
plt.plot(x_b, y_b)
With this approach, the obtained values are the following:
predictions = [(0, 0.0003405062891601558), (1, 0.004161743534179685),
(2, 0.02288958943798826), (3, 0.07460310631640629),
(4, 0.15956775517675784), (5, 0.2340327075925782),
(6, 0.2383666466220704), (7, 0.1664782928789064),
(8, 0.07630255090283203), (9, 0.020724149627929712),
(10, 0.0025329516211914063)]
You can take this approach and explore it further, trying to identify a more specific distribution or, if you have more data related to other aspects of your model, applying Bayes Theorem to better fit the desired predictions.
As Vivek mentioned in his comment, you cannot do anything about the classes for which you have no data. As far As the remaining classes are concerned, some of them have too little samples. You could try class weights (available in sklearn) or under-sampling, but I doubt they would work well.
Time spent in obtaining more data for those classes is a good idea. IF that is not possible, maybe having two classifiers: one for the low-number classes and another for the other classes. You can use a third classifier for splitting a given instance into either of these two classes (basically a hierarchical classifier)
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.