如何计算逻辑回归精度

Question

I am a complete beginner in machine learning and coding in python, and I have been tasked with coding logistic regression from scratch to understand what happens under the hood. 我是python机器学习和编码的完全入门者，我受过从零开始进行逻辑回归编码的任务，以了解幕后发生的事情。 So far I have coded for the hypothesis function, cost function and gradient descent, and then coded for the logistic regression. 到目前为止，我已经对假设函数，成本函数和梯度下降进行了编码，然后对逻辑回归进行了编码。 However on coding for printing the accuracy I get a low output (0.69) which doesnt change with increasing iterations or changing the learning rate. 但是，在为打印精度进行编码时，我得到的输出很低（0.69），它不会随着迭代次数的增加或学习率的改变而变化。 My question is, is there a problem with my accuracy code below? 我的问题是，下面的我的准确性代码是否有问题？ Any help pointing to the right direction would be appreciated 任何指向正确方向的帮助将不胜感激

X = data[['radius_mean', 'texture_mean', 'perimeter_mean',
   'area_mean', 'smoothness_mean', 'compactness_mean', 'concavity_mean',
   'concave points_mean', 'symmetry_mean', 'fractal_dimension_mean',
   'radius_se', 'texture_se', 'perimeter_se', 'area_se', 'smoothness_se',
   'compactness_se', 'concavity_se', 'concave points_se', 'symmetry_se',
   'fractal_dimension_se', 'radius_worst', 'texture_worst',
   'perimeter_worst', 'area_worst', 'smoothness_worst',
   'compactness_worst', 'concavity_worst', 'concave points_worst',
   'symmetry_worst', 'fractal_dimension_worst']]
X = np.array(X)
X = min_max_scaler.fit_transform(X)
Y = data["diagnosis"].map({'M':1,'B':0})
Y = np.array(Y)

X_train,X_test,Y_train,Y_test = train_test_split(X,Y,test_size=0.25)

X = data["diagnosis"].map(lambda x: float(x))

def Sigmoid(z):
    if z < 0:
        return 1 - 1/(1 + math.exp(z))
    else:
        return 1/(1 + math.exp(-z))

def Hypothesis(theta, x):
    z = 0
    for i in range(len(theta)):
        z += x[i]*theta[i]
    return Sigmoid(z)

def Cost_Function(X,Y,theta,m):
    sumOfErrors = 0
    for i in range(m):
        xi = X[i]
        hi = Hypothesis(theta,xi)
        error = Y[i] * math.log(hi if  hi >0 else 1)
        if Y[i] == 1:
            error = Y[i] * math.log(hi if  hi >0 else 1)
        elif Y[i] == 0:
            error = (1-Y[i]) * math.log(1-hi  if  1-hi >0 else 1)
        sumOfErrors += error

    constant = -1/m
    J = constant * sumOfErrors
    #print ('cost is: ', J ) 
    return J

def Cost_Function_Derivative(X,Y,theta,j,m,alpha):
    sumErrors = 0
    for i in range(m):
        xi = X[i]
        xij = xi[j]
        hi = Hypothesis(theta,X[i])
        error = (hi - Y[i])*xij
        sumErrors += error
    m = len(Y)
    constant = float(alpha)/float(m)
    J = constant * sumErrors
    return J

def Gradient_Descent(X,Y,theta,m,alpha):
    new_theta = []
    constant = alpha/m
    for j in range(len(theta)):
        CFDerivative = Cost_Function_Derivative(X,Y,theta,j,m,alpha)
        new_theta_value = theta[j] - CFDerivative
        new_theta.append(new_theta_value)
    return new_theta


def Accuracy(theta):
    correct = 0
    length = len(X_test, Hypothesis(X,theta))
    for i in range(length):
        prediction = round(Hypothesis(X[i],theta))
        answer = Y[i]
    if prediction == answer.all():
            correct += 1
    my_accuracy = (correct / length)*100
    print ('LR Accuracy %: ', my_accuracy)



def Logistic_Regression(X,Y,alpha,theta,num_iters):
    theta = np.zeros(X.shape[1])
    m = len(Y)
    for x in range(num_iters):
        new_theta = Gradient_Descent(X,Y,theta,m,alpha)
        theta = new_theta
        if x % 100 == 0:
            Cost_Function(X,Y,theta,m)
            print ('theta: ', theta)    
            print ('cost: ', Cost_Function(X,Y,theta,m))
    Accuracy(theta)

initial_theta = [0,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1]  
alpha = 0.0001
iterations = 1000
Logistic_Regression(X,Y,alpha,initial_theta,iterations)

This is using data from the wisconsin breast cancer dataset ( https://www.kaggle.com/uciml/breast-cancer-wisconsin-data ) where I am weighing in 30 features - although changing the features to ones which are known to correlate also doesn't change my accuracy. 这是使用威斯康星州乳腺癌数据集（ https://www.kaggle.com/uciml/breast-cancer-wisconsin-data ）中的数据，其中我权衡了30个特征-尽管将特征更改为已知相关的特征也不会改变我的准确性。

Answer 1

Python gives us this scikit-learn library that makes our work easier, this worked for me: Python为我们提供了这个scikit-learn库，使我们的工作更加轻松，这对我来说很有效：

from sklearn.metrics import accuracy_score

y_pred = log.predict(x_test)

score =accuracy_score(y_test,y_pred)

Answer 2

I'm not sure how you arrived at a value of 0.0001 for alpha , but I think it's too low. 我不确定您如何得出alpha值为0.0001 ，但我认为它太低了。 Using your code with the cancer data shows that cost is decreasing with each iteration -- it's just going glacially. 将代码与癌症数据一起使用表明，每次迭代的成本都在降低-这只是冰山一角。

When I raise this to 0.5, I still get a decreasing costs, but at a more reasonable level. 当我将其提高到0.5时，我仍然得到了降低的成本，但是在一个更合理的水平上。 After 1000 iterations it reports: 经过1000次迭代后，它报告：

cost:  0.23668000993020666

And after fixing the Accuracy function I'm getting 92% on the test segment of the data. 修复了Accuracy函数后，我在数据测试段上获得了92％的收益。

You have Numpy installed, as shown by X = np.array(X) . 您已经安装了Numpy，如X = np.array(X) 。 You should really consider using it for your operations. 您应该真正考虑将其用于操作。 It will be orders of magnitude faster for jobs like this. 这样的工作将会快几个数量级 。 Here is a vectorized version that gives results instantly rather than waiting: 这是向量化版本，可立即提供结果，而无需等待：

import math
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
from sklearn.preprocessing import MinMaxScaler
from sklearn.model_selection import train_test_split

df = pd.read_csv("cancerdata.csv")
X = df.values[:,2:-1].astype('float64')
X = (X - np.mean(X, axis =0)) /  np.std(X, axis = 0)

## Add a bias column to the data
X = np.hstack([np.ones((X.shape[0], 1)),X])
X = MinMaxScaler().fit_transform(X)
Y = df["diagnosis"].map({'M':1,'B':0})
Y = np.array(Y)
X_train,X_test,Y_train,Y_test = train_test_split(X,Y,test_size=0.25)


def Sigmoid(z):
    return 1/(1 + np.exp(-z))

def Hypothesis(theta, x):   
    return Sigmoid(x @ theta) 

def Cost_Function(X,Y,theta,m):
    hi = Hypothesis(theta, X)
    _y = Y.reshape(-1, 1)
    J = 1/float(m) * np.sum(-_y * np.log(hi) - (1-_y) * np.log(1-hi))
    return J

def Cost_Function_Derivative(X,Y,theta,m,alpha):
    hi = Hypothesis(theta,X)
    _y = Y.reshape(-1, 1)
    J = alpha/float(m) * X.T @ (hi - _y)
    return J

def Gradient_Descent(X,Y,theta,m,alpha):
    new_theta = theta - Cost_Function_Derivative(X,Y,theta,m,alpha)
    return new_theta

def Accuracy(theta):
    correct = 0
    length = len(X_test)
    prediction = (Hypothesis(theta, X_test) > 0.5)
    _y = Y_test.reshape(-1, 1)
    correct = prediction == _y
    my_accuracy = (np.sum(correct) / length)*100
    print ('LR Accuracy %: ', my_accuracy)

def Logistic_Regression(X,Y,alpha,theta,num_iters):
    m = len(Y)
    for x in range(num_iters):
        new_theta = Gradient_Descent(X,Y,theta,m,alpha)
        theta = new_theta
        if x % 100 == 0:
            #print ('theta: ', theta)    
            print ('cost: ', Cost_Function(X,Y,theta,m))
    Accuracy(theta)

ep = .012

initial_theta = np.random.rand(X_train.shape[1],1) * 2 * ep - ep
alpha = 0.5
iterations = 2000
Logistic_Regression(X_train,Y_train,alpha,initial_theta,iterations)

I think I might have a different versions of scikit, because I had change the MinMaxScaler line to make it work. 我想我可能有不同版本的scikit，因为我更改了MinMaxScaler行以使其工作。 The result is that I can 10K iterations in the blink of an eye and the results of the applying the model to the test set is about 97% accuracy. 结果是，我可以眨眼间进行10K次迭代，将模型应用于测试集的结果的准确性约为97％。

Answer 3

Accuracy is one of the most intuitive performance measure and it is simply a ratio of correctly predicted observation to the total observations. 准确性是最直观的性能指标之一，它只是正确预测的观测值与总观测值的比率。 Higher accuracy means model is preforming better. 更高的精度意味着模型的性能更好。

Accuracy = TP+TN/TP+FP+FN+TN

TP = True positives
TN = True negatives
FN = False negatives
TN = True negatives

While you are using accuracy measure your false positives and false negatives should be of similar cost. 在使用准确性度量时，误报和误报的成本应相近。 A better metric is the F1-score which is given by 更好的指标是F1分数，由

F1-score = 2*(Recall*Precision)/Recall+Precision where,

Precision = TP/TP+FP
Recall = TP/TP+FN

Read more here 在这里阅读更多

https://en.wikipedia.org/wiki/Precision_and_recall https://zh.wikipedia.org/wiki/Precision_and_recall

The beauty about machine learning in python is that important modules like scikit-learn is open source so you can always look at the actual code. 使用python进行机器学习的好处在于，像scikit-learn这样的重要模块是开源的，因此您始终可以查看实际的代码。 Please use the below link to scikit learn metrics source code which will give you an idea how scikit-learn calculates the accuracy score when you do 请使用以下链接访问scikit学习指标源代码，该源代码将使您了解scikit-learn在执行此操作时如何计算准确性得分

from sklearn.metrics import accuracy_score
accuracy_score(y_true, y_pred)

https://github.com/scikit-learn/scikit-learn/tree/master/sklearn/metrics https://github.com/scikit-learn/scikit-learn/tree/master/sklearn/metrics

如何计算逻辑回归精度

问题描述

3 个解决方案

解决方案1
2 2019-07-04 18:16:50

解决方案2
1 已采纳 2017-11-23 00:45:13

解决方案3
1 2017-11-23 05:42:11

如何计算逻辑回归精度

问题描述

3 个解决方案

解决方案1 2 2019-07-04 18:16:50

解决方案2 1 已采纳 2017-11-23 00:45:13

解决方案3 1 2017-11-23 05:42:11

解决方案1
2 2019-07-04 18:16:50

解决方案2
1 已采纳 2017-11-23 00:45:13

解决方案3
1 2017-11-23 05:42:11