简体   繁体   中英

Crosstab and confusion_matrix results disagreement in Python

I need to produce a confusion matrix using the crosstab function in Python (as an exercise). I have been doing this with various data sets and it worked fine but this time I'm having an odd problem.

The data set is divided into training and test sets (X_train, y_train, X_test, y_test). The test set is a Series of 0s and 1s constituting the response variable. I ran logistic regression on the training set, and predicted the value of the test set:

logit1 = sm.Logit(y_train, X_train).fit()
pred = logit1.predict(X_test)

Then, I use the cut off of 0.5 to classify the value of the response and as a result I have a Series of 0s and 1s of the same length as y_test (2500). This Series is called res and now I want to create the confusion table with crosstab:

cross_table = pd.crosstab(y_test, res, rownames=['Actual'], colnames=['Predicted'], margins=True)

But this gives me the following table which doesn't add up to 2500:

Predicted  0.0  1.0  All
Actual                  
0.0        413   52  465
1.0        140   20  160
All        553   72  625

While when I use the confusion_matrix function from sklearn, I get the correct total of 2500:

confusion_matrix(y_test, res)

array([[1817,  110],
       [ 369,  205]])

What is the problem here with my crosstab????

Packages:

from pandas import Series, DataFrame
import pandas as pd
import statsmodels.api as sm
from sklearn.metrics import confusion_matrix

Full code:

# indexes of train and test were provided in external files:
train = pd.read_csv('/Users//train.csv')
test = pd.read_csv('/Users//test.csv')

X_train = X.iloc[train.values[:,0],:]
X_test = X.iloc[test.values[:,0],:]

y_train = y[train.values[:,0]]
y_test = y[test.values[:,0]]

logit1 = sm.Logit(y_train, X_train).fit()
pred = logit1.predict(X_test)

res = []
for i in pred:
    if i >= 0.5:
        each = 1
    else:
        each = 0
    res.append(each)

res = Series(res)

cross_table = pd.crosstab(y_test, res, rownames=['Actual'], colnames=['Predicted'], margins=True)

d = confusion_matrix(y_test, res)

Suggested edit:

cross_table = pd.crosstab(y_test, res, rownames=['Actual'], 
colnames=['Predicted'], margins=True,dropna=False)

Predicted   0.0  1.0   All
Actual                    
0.0         413   52  1927
1.0         140   20   574
All        2186  315  4377

While I still don't know why the above didn't work, I figured out what needs to be changed to make it work. The object res, containing the predictions, needs to be saved as an array:

import numpy as np

res = np.array(res)
cross_table = pd.crosstab(y_test, res, rownames=['Actual'], colnames=['Predicted'], margins=True)

Predicted     0    1   All
Actual                    
0          1817  110  1927
1           369  205   574
All        2186  315  2501

Which is the same as the result from confusion_matrix.

If I do:

import numpy as np
import pandas as pd
data = np.array([1, 1, 0, 0, 0])
data2 = np.array([1, 0, 0, 0, 1])
y_test =  pd.Series(data) 
res = pd.Series(data2)

and run: pd.crosstab(y_test, res, rownames=['Actual'], colnames=['Predicted'], margins=True)

I get:

在此处输入图片说明

which is correct.

And also:

from sklearn.metrics import ocnfusion_matrix
confusion_matrix(y_test, res)

在此处输入图片说明

Gives me the correct output, so the error is somewhere else.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM