I am trying to use decision tree for classification and get 100% accuracy.
It is a common problem, described here and here . And in many other questions.
Data is here .
Two best guesses:
What is wrong with my code?
import pandas as pd
import numpy as np
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import classification_report
from sklearn.model_selection import cross_val_score
import sklearn.model_selection as cv
from sklearn.metrics import mean_squared_error as MSE
from sklearn.model_selection import train_test_split
from sklearn import metrics
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score
# Split data
Y = starbucks.iloc[:, 4]
X = starbucks.loc[:, starbucks.columns != 'offer_completed']
# Splitting the dataset into train and test
X_train, X_test, y_train, y_test = train_test_split(X, Y,
test_size=0.3,
random_state=100)
# Creating the classifier object
clf_gini = DecisionTreeClassifier(criterion = "gini",
random_state = 100,
max_depth = 3,
min_samples_leaf = 5)
# Performing training
clf_gini.fit(X_train, y_train)
# Predicton on test with giniIndex
y_pred = clf_gini.predict(X_test)
print("Predicted values:")
print(y_pred)
print("Confusion Matrix: ", confusion_matrix(y_test, y_pred))
print ("Accuracy : ", accuracy_score(y_test, y_pred)*100)
print("Report : ", classification_report(y_test, y_pred))
y_pred_gini = prediction(X_test, clf_gini)
cal_accuracy(y_test, y_pred_gini)
Predicted values:
[0. 0. 0. ... 0. 0. 0.]
Confusion Matrix: [[36095 0]
[ 0 8158]]
Accuracy : 100.0
When I print X, it shows me that offer_completed
was removed.
X.dtypes
offer_received int64
offer_viewed float64
time_viewed_received float64
time_completed_received float64
time_completed_viewed float64
transaction float64
amount float64
total_reward float64
age float64
income float64
male int64
membership_days float64
reward_each_time float64
difficulty float64
duration float64
email float64
mobile float64
social float64
web float64
bogo float64
discount float64
informational float64
Fitting the model and checking feature importances you can see that they are all zeros except for total_reward
. Then investingating such column you get:
df.groupby(target)['total_reward'].describe()
count mean std min 25% 50% 75% max
0 119995 0.0 0.0 0.0 0.0 0.0 0.0 0.0
1 27513 5.74 4.07 2.0 3.0 5.0 10.0 40.0
You can see that for target 0, total_reward
is always zero, otherwise has a value always greater than 0. Here's your leak.
As there could be other leaks and it is tedious to check each column, we can use a sort of "predictive power" of each feature alone:
acc_df = pd.DataFrame(columns=['col', 'acc'], index=range(len(X.columns)))
for i, c in enumerate(X.columns):
clf = DecisionTreeClassifier(criterion = "gini",
random_state = 100,
max_depth = 3,
min_samples_leaf = 5)
clf.fit(X_train[c].to_numpy()[:, None], y_train)
y_pred = clf.predict(X_test[c].to_numpy()[:, None])
acc_df.iloc[i] = [c, accuracy_score(y_test, y_pred)*100]
acc_df.sort_values('acc',ascending=False)
col acc
8 total_reward 100
4 completed_time 99.8848
13 reward_each_time 89.3205
14 difficulty 89.3205
15 duration 89.3205
21 discount 86.4054
19 web 85.088
20 bogo 84.4801
3 viewed_time 84.4056
2 offer_viewed 84.3491
18 social 83.3525
1 received_time 83.0497
7 amount 82.5436
0 offer_received 81.7526
16 email 81.7526
17 mobile 81.6464
11 male 81.5651
10 income 81.5651
9 age 81.5651
6 transaction_time 81.5651
5 transaction 81.5651
22 informational 81.5651
12 membership_days 81.5561
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.