Machine learning: Classification on imbalanced data

Question

I am solving for a classification problem using Python's sklearn + xgboost module. I have a highly imbalanced data with ~92% of class 0 and only 8% class 1. The train data set can be download here. http://www.filedropper.com/kangarootrain

I cant use numclaims and claimcst0 variables in this dataset. The variables in this dataset are: id,claimcst0,veh_value,exposure,veh_body,veh_age,gender,area,agecat,clm,numclaims

gender, area, and agecat are categorical variables and rest are continuous variables. Id is the id for that record.

Top 10 records are

id,claimcst0,veh_value,exposure,veh_body,veh_age,gender,area,agecat,clm,numclaims
1,0,6.43,0.241897754,STNWG,1,M,A,3,0,0
2,0,4.46,0.856522757,STNWG,1,M,A,3,0,0
3,0,1.7,0.417516596,HBACK,1,M,A,4,0,0
4,0,0.48,0.626974524,SEDAN,4,F,A,6,0,0
5,0,1.96,0.089770031,HBACK,1,F,A,2,0,0
6,0,1.78,0.25654335,HBACK,2,M,A,3,0,0
7,0,2.7,0.688128611,UTE,2,M,A,1,0,0
8,0,0.94,0.912765859,STNWG,4,M,A,2,0,0
9,0,1.98,0.157753423,SEDAN,2,M,A,4,0,0

I tried several methods to predict the 'clm' which is my target variables. I tried knn, RF, svm, nb. I even tried to subsample data. But whatever I do does not make the predictions better. With trees/boosting I am getting ~93% accuracy but only because I am predicting all the 0's correctly.

The model is incorrectly predicting all the 1s as 0s too.

Any help would be really helpful. This is the basic code I tried for NB.

from sklearn.naive_bayes import GaussianNB

clfnb = GaussianNB()
clfnb.fit(x_train, y_train)
pred = clfnb.predict(x_test)
#print set(pred)
from sklearn.metrics import accuracy_score, confusion_matrix
print accuracy_score(y_test, pred)
print confusion_matrix(y_test, pred)

0.92816091954
[[8398    0]
[ 650    0]]

Answer 1

This is a pretty common challenge that your 2 categories are not balanced. To overcome the issue of predicting only one category well, you have to use a balanced training set. There are several solutions, the most basic is to sample your data evenly. Since you have about 1500 sample of 1s you should also get 1500 of 0s.

n = 1500
sample_yes = data.ix[data.y == 1].sample(n=n, replace=False, random_state=0)
sample_no = data.ix[data.y == 0].sample(n=n, replace=False, random_state=0)
df = pd.concat([sample_yes, sample_no])

Where data is the original dataframe. You should do this before you split your data to training and test set.

Answer 2

I have exactly this problem if not worse. One solution I found is to oversample the 1s as per these:

http://www.data-mining-blog.com/tips-and-tutorials/overrepresentation-oversampling/

https://yiminwu.wordpress.com/2013/12/03/how-to-undo-oversampling-explained/

Answer 3

You can assign the class_weight parameter to the imbalanced dataset. For example, in this case since label 1 only has 8% of data, you give the label the higher weight while doing the classification.

http://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html

class_weight : {dict, 'balanced'}, optional Set the parameter C of class i to class_weight[i]*C for SVC. If not given, all classes are supposed to have weight one. The “balanced” mode uses the values of y to automatically adjust weights inversely proportional to class frequencies in the input data as n_samples / (n_classes * np.bincount(y))

Answer 4

For imbalanced dataset, I used the "weights" parameter in Xgboost where weights is an array of weight assigned according to the class the data belongs to.

def CreateBalancedSampleWeights(y_train, largest_class_weight_coef):
classes = np.unique(y_train, axis = 0)
classes.sort()
class_samples = np.bincount(y_train)
total_samples = class_samples.sum()
n_classes = len(class_samples)
weights = total_samples / (n_classes * class_samples * 1.0)
class_weight_dict = {key : value for (key, value) in zip(classes, weights)}
class_weight_dict[classes[1]] = class_weight_dict[classes[1]] * 
largest_class_weight_coef
sample_weights = [class_weight_dict[y] for y in y_train]
return sample_weights

Just pass the target column and the occurance rate of most frequent class (if most frequent class has 75 out of 100 samples, then its 0.75)

largest_class_weight_coef = 
max(df_copy['Category'].value_counts().values)/df.shape[0]

#pass y_train as numpy array
weight = CreateBalancedSampleWeights(y_train, largest_class_weight_coef)

#And then use it like this
xg = XGBClassifier(n_estimators=1000, weights = weight, max_depth=20)

Thats it :) Now your model will give more weightage to less frequent class data.

Machine learning: Classification on imbalanced data

Question

4 answers

solution1
1 2016-11-14 10:35:01

solution2
0 2016-11-13 16:29:13

solution3
0 2016-11-28 21:45:01

solution4
0 2019-12-05 12:09:17

Machine learning: Classification on imbalanced data

Question

4 answers

solution1 1 2016-11-14 10:35:01

solution2 0 2016-11-13 16:29:13

solution3 0 2016-11-28 21:45:01

solution4 0 2019-12-05 12:09:17

solution1
1 2016-11-14 10:35:01

solution2
0 2016-11-13 16:29:13

solution3
0 2016-11-28 21:45:01

solution4
0 2019-12-05 12:09:17