[英]How to calculate logistic regression accuracy
我是python機器學習和編碼的完全入門者,我受過從零開始進行邏輯回歸編碼的任務,以了解幕后發生的事情。 到目前為止,我已經對假設函數,成本函數和梯度下降進行了編碼,然后對邏輯回歸進行了編碼。 但是,在為打印精度進行編碼時,我得到的輸出很低(0.69),它不會隨着迭代次數的增加或學習率的改變而變化。 我的問題是,下面的我的准確性代碼是否有問題? 任何指向正確方向的幫助將不勝感激
X = data[['radius_mean', 'texture_mean', 'perimeter_mean',
'area_mean', 'smoothness_mean', 'compactness_mean', 'concavity_mean',
'concave points_mean', 'symmetry_mean', 'fractal_dimension_mean',
'radius_se', 'texture_se', 'perimeter_se', 'area_se', 'smoothness_se',
'compactness_se', 'concavity_se', 'concave points_se', 'symmetry_se',
'fractal_dimension_se', 'radius_worst', 'texture_worst',
'perimeter_worst', 'area_worst', 'smoothness_worst',
'compactness_worst', 'concavity_worst', 'concave points_worst',
'symmetry_worst', 'fractal_dimension_worst']]
X = np.array(X)
X = min_max_scaler.fit_transform(X)
Y = data["diagnosis"].map({'M':1,'B':0})
Y = np.array(Y)
X_train,X_test,Y_train,Y_test = train_test_split(X,Y,test_size=0.25)
X = data["diagnosis"].map(lambda x: float(x))
def Sigmoid(z):
if z < 0:
return 1 - 1/(1 + math.exp(z))
else:
return 1/(1 + math.exp(-z))
def Hypothesis(theta, x):
z = 0
for i in range(len(theta)):
z += x[i]*theta[i]
return Sigmoid(z)
def Cost_Function(X,Y,theta,m):
sumOfErrors = 0
for i in range(m):
xi = X[i]
hi = Hypothesis(theta,xi)
error = Y[i] * math.log(hi if hi >0 else 1)
if Y[i] == 1:
error = Y[i] * math.log(hi if hi >0 else 1)
elif Y[i] == 0:
error = (1-Y[i]) * math.log(1-hi if 1-hi >0 else 1)
sumOfErrors += error
constant = -1/m
J = constant * sumOfErrors
#print ('cost is: ', J )
return J
def Cost_Function_Derivative(X,Y,theta,j,m,alpha):
sumErrors = 0
for i in range(m):
xi = X[i]
xij = xi[j]
hi = Hypothesis(theta,X[i])
error = (hi - Y[i])*xij
sumErrors += error
m = len(Y)
constant = float(alpha)/float(m)
J = constant * sumErrors
return J
def Gradient_Descent(X,Y,theta,m,alpha):
new_theta = []
constant = alpha/m
for j in range(len(theta)):
CFDerivative = Cost_Function_Derivative(X,Y,theta,j,m,alpha)
new_theta_value = theta[j] - CFDerivative
new_theta.append(new_theta_value)
return new_theta
def Accuracy(theta):
correct = 0
length = len(X_test, Hypothesis(X,theta))
for i in range(length):
prediction = round(Hypothesis(X[i],theta))
answer = Y[i]
if prediction == answer.all():
correct += 1
my_accuracy = (correct / length)*100
print ('LR Accuracy %: ', my_accuracy)
def Logistic_Regression(X,Y,alpha,theta,num_iters):
theta = np.zeros(X.shape[1])
m = len(Y)
for x in range(num_iters):
new_theta = Gradient_Descent(X,Y,theta,m,alpha)
theta = new_theta
if x % 100 == 0:
Cost_Function(X,Y,theta,m)
print ('theta: ', theta)
print ('cost: ', Cost_Function(X,Y,theta,m))
Accuracy(theta)
initial_theta = [0,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1]
alpha = 0.0001
iterations = 1000
Logistic_Regression(X,Y,alpha,initial_theta,iterations)
這是使用威斯康星州乳腺癌數據集( https://www.kaggle.com/uciml/breast-cancer-wisconsin-data )中的數據,其中我權衡了30個特征-盡管將特征更改為已知相關的特征也不會改變我的准確性。
Python為我們提供了這個scikit-learn庫,使我們的工作更加輕松,這對我來說很有效:
from sklearn.metrics import accuracy_score
y_pred = log.predict(x_test)
score =accuracy_score(y_test,y_pred)
我不確定您如何得出alpha
值為0.0001
,但我認為它太低了。 將代碼與癌症數據一起使用表明,每次迭代的成本都在降低-這只是冰山一角。
當我將其提高到0.5時,我仍然得到了降低的成本,但是在一個更合理的水平上。 經過1000次迭代后,它報告:
cost: 0.23668000993020666
修復了Accuracy
函數后,我在數據測試段上獲得了92%的收益。
您已經安裝了Numpy,如X = np.array(X)
。 您應該真正考慮將其用於操作。 這樣的工作將會快幾個數量級 。 這是向量化版本,可立即提供結果,而無需等待:
import math
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
from sklearn.preprocessing import MinMaxScaler
from sklearn.model_selection import train_test_split
df = pd.read_csv("cancerdata.csv")
X = df.values[:,2:-1].astype('float64')
X = (X - np.mean(X, axis =0)) / np.std(X, axis = 0)
## Add a bias column to the data
X = np.hstack([np.ones((X.shape[0], 1)),X])
X = MinMaxScaler().fit_transform(X)
Y = df["diagnosis"].map({'M':1,'B':0})
Y = np.array(Y)
X_train,X_test,Y_train,Y_test = train_test_split(X,Y,test_size=0.25)
def Sigmoid(z):
return 1/(1 + np.exp(-z))
def Hypothesis(theta, x):
return Sigmoid(x @ theta)
def Cost_Function(X,Y,theta,m):
hi = Hypothesis(theta, X)
_y = Y.reshape(-1, 1)
J = 1/float(m) * np.sum(-_y * np.log(hi) - (1-_y) * np.log(1-hi))
return J
def Cost_Function_Derivative(X,Y,theta,m,alpha):
hi = Hypothesis(theta,X)
_y = Y.reshape(-1, 1)
J = alpha/float(m) * X.T @ (hi - _y)
return J
def Gradient_Descent(X,Y,theta,m,alpha):
new_theta = theta - Cost_Function_Derivative(X,Y,theta,m,alpha)
return new_theta
def Accuracy(theta):
correct = 0
length = len(X_test)
prediction = (Hypothesis(theta, X_test) > 0.5)
_y = Y_test.reshape(-1, 1)
correct = prediction == _y
my_accuracy = (np.sum(correct) / length)*100
print ('LR Accuracy %: ', my_accuracy)
def Logistic_Regression(X,Y,alpha,theta,num_iters):
m = len(Y)
for x in range(num_iters):
new_theta = Gradient_Descent(X,Y,theta,m,alpha)
theta = new_theta
if x % 100 == 0:
#print ('theta: ', theta)
print ('cost: ', Cost_Function(X,Y,theta,m))
Accuracy(theta)
ep = .012
initial_theta = np.random.rand(X_train.shape[1],1) * 2 * ep - ep
alpha = 0.5
iterations = 2000
Logistic_Regression(X_train,Y_train,alpha,initial_theta,iterations)
我想我可能有不同版本的scikit,因為我更改了MinMaxScaler
行以使其工作。 結果是,我可以眨眼間進行10K次迭代,將模型應用於測試集的結果的准確性約為97%。
准確性是最直觀的性能指標之一,它只是正確預測的觀測值與總觀測值的比率。 更高的精度意味着模型的性能更好。
Accuracy = TP+TN/TP+FP+FN+TN
TP = True positives
TN = True negatives
FN = False negatives
TN = True negatives
在使用准確性度量時,誤報和誤報的成本應相近。 更好的指標是F1分數,由
F1-score = 2*(Recall*Precision)/Recall+Precision where,
Precision = TP/TP+FP
Recall = TP/TP+FN
在這里閱讀更多
https://zh.wikipedia.org/wiki/Precision_and_recall
使用python進行機器學習的好處在於,像scikit-learn這樣的重要模塊是開源的,因此您始終可以查看實際的代碼。 請使用以下鏈接訪問scikit學習指標源代碼,該源代碼將使您了解scikit-learn在執行此操作時如何計算准確性得分
from sklearn.metrics import accuracy_score
accuracy_score(y_true, y_pred)
https://github.com/scikit-learn/scikit-learn/tree/master/sklearn/metrics
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.