Fixing the 100% accuracy with DecisionTreeClassifier in scikit-learn

Question

I am trying to use decision tree for classification and get 100% accuracy.

It is a common problem, described here and here . And in many other questions.

Data is here .

Two best guesses:

I split data incorrectly
My dataset is too imbalanced

What is wrong with my code?

import pandas as pd
import numpy as np
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import classification_report
from sklearn.model_selection import cross_val_score
import sklearn.model_selection as cv
from sklearn.metrics import mean_squared_error as MSE
from sklearn.model_selection import train_test_split 
from sklearn import metrics
from sklearn.metrics import confusion_matrix 
from sklearn.metrics import accuracy_score 

# Split data
Y = starbucks.iloc[:, 4]
X = starbucks.loc[:, starbucks.columns != 'offer_completed']

# Splitting the dataset into train and test 
X_train, X_test, y_train, y_test = train_test_split(X, Y, 
                                                    test_size=0.3,
                                                    random_state=100) 

# Creating the classifier object 
clf_gini = DecisionTreeClassifier(criterion = "gini", 
                                  random_state = 100, 
                                  max_depth = 3, 
                                  min_samples_leaf = 5) 

# Performing training 
clf_gini.fit(X_train, y_train)

# Predicton on test with giniIndex 
y_pred = clf_gini.predict(X_test) 
print("Predicted values:") 
print(y_pred) 

print("Confusion Matrix: ", confusion_matrix(y_test, y_pred)) 

print ("Accuracy : ", accuracy_score(y_test, y_pred)*100) 

print("Report : ", classification_report(y_test, y_pred)) 

y_pred_gini = prediction(X_test, clf_gini) 
cal_accuracy(y_test, y_pred_gini) 


Predicted values:
[0. 0. 0. ... 0. 0. 0.]
Confusion Matrix:  [[36095     0]
                    [    0  8158]]
Accuracy :  100.0

When I print X, it shows me that offer_completed was removed.

X.dtypes

offer_received               int64
offer_viewed               float64
time_viewed_received       float64
time_completed_received    float64
time_completed_viewed      float64
transaction                float64
amount                     float64
total_reward               float64
age                        float64
income                     float64
male                         int64
membership_days            float64
reward_each_time           float64
difficulty                 float64
duration                   float64
email                      float64
mobile                     float64
social                     float64
web                        float64
bogo                       float64
discount                   float64
informational              float64

Answer 1

Fitting the model and checking feature importances you can see that they are all zeros except for total_reward . Then investingating such column you get:

df.groupby(target)['total_reward'].describe()
    count   mean    std    min   25%    50%   75%    max
0   119995  0.0     0.0    0.0   0.0    0.0   0.0    0.0
1   27513   5.74    4.07   2.0   3.0    5.0   10.0   40.0

You can see that for target 0, total_reward is always zero, otherwise has a value always greater than 0. Here's your leak.

As there could be other leaks and it is tedious to check each column, we can use a sort of "predictive power" of each feature alone:

acc_df = pd.DataFrame(columns=['col', 'acc'], index=range(len(X.columns)))

for i, c in enumerate(X.columns):

    clf = DecisionTreeClassifier(criterion = "gini", 
                                 random_state = 100, 
                                 max_depth = 3, 
                                 min_samples_leaf = 5) 
    
    clf.fit(X_train[c].to_numpy()[:, None], y_train)
    
    y_pred = clf.predict(X_test[c].to_numpy()[:, None])
    acc_df.iloc[i] = [c, accuracy_score(y_test, y_pred)*100]


acc_df.sort_values('acc',ascending=False)
                 col      acc
8       total_reward      100
4     completed_time  99.8848
13  reward_each_time  89.3205
14        difficulty  89.3205
15          duration  89.3205
21          discount  86.4054
19               web   85.088
20              bogo  84.4801
3        viewed_time  84.4056
2       offer_viewed  84.3491
18            social  83.3525
1      received_time  83.0497
7             amount  82.5436
0     offer_received  81.7526
16             email  81.7526
17            mobile  81.6464
11              male  81.5651
10            income  81.5651
9                age  81.5651
6   transaction_time  81.5651
5        transaction  81.5651
22     informational  81.5651
12   membership_days  81.5561

Fixing the 100% accuracy with DecisionTreeClassifier in scikit-learn

Question

1 answers

solution1
2 ACCPTED 2020-08-06 08:42:12

Fixing the 100% accuracy with DecisionTreeClassifier in scikit-learn

Question

1 answers

solution1 2 ACCPTED 2020-08-06 08:42:12

solution1
2 ACCPTED 2020-08-06 08:42:12