Python Logistic Regression

Question

I have been at this for a couple of hours and feel really really stuck now.

I am trying to use a bunch of columns in a csv "ScoreBuckets.csv" to predict another column in that csv called "Score_Bucket". I would like to use multiple columns in the csv to predict the column Score_Bucket. The problem I am having is that my results don't make any sense at all, and I don't know how to use multiple columns to predict the column Score_Bucket. I am new to data mining, so I am not 100% familiar with the code/syntax.

Here is the code I have so far:

import pandas as pd
import numpy as np
from sklearn import metrics
from sklearn.linear_model import LogisticRegression
from sklearn.cross_validation import KFold, cross_val_score

dataset = pd.read_csv('ScoreBuckets.csv')

CV =  (dataset.Score_Bucket.reshape((len(dataset.Score_Bucket), 1))).ravel()
data = (dataset.ix[:,'CourseLoad_RelativeStudy':'Sleep_Sex'].values).reshape(
           (len(dataset.Score_Bucket), 2))


# Create a KNN object
LogReg = LogisticRegression()

# Train the model using the training sets
LogReg.fit(data, CV)

# the model
print('Coefficients (m): \n', LogReg.coef_)
print('Intercept (b): \n', LogReg.intercept_)

#predict the class for each data point
predicted = LogReg.predict(data)
print("Predictions: \n", np.array([predicted]).T)

# predict the probability/likelihood of the prediction
print("Probability of prediction: \n",LogReg.predict_proba(data))
modelAccuracy = LogReg.score(data,CV)
print("Accuracy score for the model: \n", LogReg.score(data,CV))

print(metrics.confusion_matrix(CV, predicted, labels=["Yes","No"]))

# Calculating 5 fold cross validation results
LogReg = LogisticRegression()
kf = KFold(len(CV), n_folds=5)
scores = cross_val_score(LogReg, data, CV, cv=kf)
print("Accuracy of every fold in 5 fold cross validation: ", abs(scores))
print("Mean of the 5 fold cross-validation: %0.2f" % abs(scores.mean()))

print("The accuracy difference between model and KFold is: ",
      abs(abs(scores.mean())-modelAccuracy))

ScoreBuckets.csv:

Score_Bucket,Healthy,Course_Load,Miss_Class,Relative_Study,Faculty,Sleep,Relation_Status,Sex,Relative_Stress,Res_Gym?,Tuition_Awareness,Satisfaction,Healthy_TuitionAwareness,Healthy_TuitionAwareness_MissClass,Healthy_MissClass_Sex,Sleep_Faculty_RelativeStress,TuitionAwareness_ResGym,CourseLoad_RelativeStudy,Sleep_Sex
5,0.5,1,0,1,0.4,0.33,1,0,0.5,1,0,0,0.75,0.5,0.17,0.41,0.5,1,0.17
2,1,1,0.33,0.5,0.4,0.33,0,0,1,0,0,0,0.5,0.44,0.44,0.58,0,0.75,0.17
5,0.5,1,0,0.5,0.4,0.33,1,0,0.5,0,1,0,0.75,0.5,0.17,0.41,0.5,0.75,0.17
4,0.5,1,0,0,0.4,0.33,0,0,0.5,0,1,0,0.25,0.17,0.17,0.41,0.5,0.5,0.17
5,0.5,1,0.33,0.5,0.4,0,1,1,1,0,1,0,0.75,0.61,0.61,0.47,0.5,0.75,0.5
5,0.5,1,0,1,0.4,0.33,1,1,1,1,1,1,0.75,0.5,0.5,0.58,1,1,0.67
5,0.5,1,0,0,0.4,0.33,0,0,0.5,0,1,0,0.25,0.17,0.17,0.41,0.5,0.5,0.17
2,0.5,1,0.67,0.5,0.4,0,1,1,0.5,0,0,0,0.75,0.72,0.72,0.3,0,0.75,0.5
5,0.5,1,0,1,0.4,0.33,0,1,1,0,1,1,0.25,0.17,0.5,0.58,0.5,1,0.67
5,1,1,0,0.5,0.4,0.33,0,1,0.5,0,1,1,0.5,0.33,0.67,0.41,0.5,0.75,0.67
0,0.5,1,0,1,0.4,0.33,0,0,0.5,0,0,0,0.25,0.17,0.17,0.41,0,1,0.17
2,0.5,1,0,0.5,0.4,0.33,1,1,1,0,0,0,0.75,0.5,0.5,0.58,0,0.75,0.67
5,0.5,1,0,1,0.4,0.33,0,0,1,1,1,0,0.25,0.17,0.17,0.58,1,1,0.17
0,0.5,1,0.33,0.5,0.4,0.33,1,1,0.5,0,1,0,0.75,0.61,0.61,0.41,0.5,0.75,0.67
5,0.5,1,0,0.5,0.4,0.33,0,0,0.5,0,1,1,0.25,0.17,0.17,0.41,0.5,0.75,0.17
4,0,1,0.67,0.5,0.4,0.67,1,0,0.5,1,0,0,0.5,0.56,0.22,0.52,0.5,0.75,0.34
2,0.5,1,0.33,1,0.4,0.33,0,0,0.5,0,1,0,0.25,0.28,0.28,0.41,0.5,1,0.17
5,0.5,1,0.33,0.5,0.4,0.33,0,1,1,0,1,0,0.25,0.28,0.61,0.58,0.5,0.75,0.67
5,0.5,1,0,1,0.4,0.33,0,0,0.5,1,1,0,0.25,0.17,0.17,0.41,1,1,0.17
5,0.5,1,0.33,0.5,0.4,0.33,1,1,1,0,1,0,0.75,0.61,0.61,0.58,0.5,0.75,0.67

Output:

Coefficients (m): 
 [[-0.4012899  -0.51699939]
 [-0.72785212 -0.55622303]
 [-0.62116232  0.30564259]
 [ 0.04222459 -0.01672418]]
Intercept (b): 
 [-1.80383738 -1.5156701  -1.29452772  0.67672118]
Predictions: 
 [[5]
 [5]
 [5]
 [5]
 ...
 [5]
 [5]
 [5]
 [5]]
Probability of prediction: 
 [[ 0.09302973  0.08929139  0.13621146  0.68146742]
 [ 0.09777325  0.10103782  0.14934111  0.65184782]
 [ 0.09777325  0.10103782  0.14934111  0.65184782]
 [ 0.10232068  0.11359509  0.16267645  0.62140778]
 ...
 [ 0.07920945  0.08045552  0.17396476  0.66637027]
 [ 0.07920945  0.08045552  0.17396476  0.66637027]
 [ 0.07920945  0.08045552  0.17396476  0.66637027]
 [ 0.07346886  0.07417316  0.18264008  0.66971789]]
Accuracy score for the model: 
 0.671171171171
[[0 0]
 [0 0]]
Accuracy of every fold in 5 fold cross validation:  
    [ 0.64444444  0.73333333  0.68181818  0.63636364  0.65909091]
Mean of the 5 fold cross-validation: 0.67
The accuracy difference between model and KFold is:  0.00016107016107

The reason I say that the output doesn't make sense is for two reasons: 1. Regardless of what data I feed for the column, the prediction accuracy stays the same and that shouldn't happen because some columns are better predictors of Score_Buckets column. 2. It won't let me use multiple columns to predict the column Score_Buckets because it says they have to be the same size, but how can that be when multiple columns would obviously have a larger array size than only the column Score_Buckets.

What am I doing wrong with the prediction?

Answer 1

First of all, double-check if your problem can really be framed as a classification problem or if it should rather be formulated as a regression problem.

Assuming you really want to classify your data into the four unique classes present in the Score_Bucket column, why do you think you cannot use multiple columns as predictors? In fact, you are using the last two columns in your example. You can make your code a bit more readable if you consider that sklearn methods directly work with Pandas DataFrames (no need for converting to NumPy arrays):

X = dataset[["CourseLoad_RelativeStudy", "Sleep_Sex"]]
y = dataset[["Score_Bucket"]]
logreg = LogisticRegression()
logreg.fit(X, y)

If you want to select more columns, you can use the loc method:

X = dataset.loc[:, "Healthy":"Sleep_Sex"]

You could also select columns by index:

X = dataset.iloc[:, 1:]

Regarding your second question, I do get different results from the cross-validation procedure depending on which columns I use as features. Just note that you have a very low number of samples (20), which makes your estimated predictions rather variable.

Python Logistic Regression

Question

1 answers

solution1
1 ACCPTED 2016-11-29 14:32:25

Python Logistic Regression

Question

1 answers

solution1 1 ACCPTED 2016-11-29 14:32:25

solution1
1 ACCPTED 2016-11-29 14:32:25