简体   繁体   中英

cross-validation with Kfold

I'm trying to use three binary explanatory variables relating a banking history: default, housing, and loan to predict the binary response variable using a Logistic Regression classifier.

I have the following dataset:

在此输入图像描述

mapping function to convert text no/yes to integer 0/1

convert_to_binary = {'no' : 0, 'yes' : 1}
default = bank['default'].map(convert_to_binary)
housing = bank['housing'].map(convert_to_binary)
loan = bank['loan'].map(convert_to_binary)
response = bank['response'].map(convert_to_binary)

I added my three explanatory variables and response to an array

data = np.array([np.array(default), np.array(housing), np.array(loan),np.array(response)]).T

kfold = KFold(n_splits=3)

scores = []
for train_index, test_index in kfold.split(data):
    X_train, X_test = data[train_index], data[test_index]
    y_train, y_test = response[train_index], response[test_index]
    model = LogisticRegression().fit(X_train, y_train)
    pred = model.predict(data[test_index])
    results = model.score(X_test, y_test)
    scores.append(results)
print(np.mean(scores))

my accuracy is always 100%, which I know is not correct. the accuracy should be somewhere around 50-65%?

Is there something I'm doing wrong?

The split is not correct

Here is the correct split

X_train, X_labels = data[train_index], response[train_index]
y_test, y_labels = data[test_index], response[test_index]
model = LogisticRegression().fit(X_train, X_labels)
pred = model.predict(y_test)
acc = sklearn.metrics.accuracy_score(y_labels,pred,normalize=True)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM