简体   繁体   English

关于逻辑回归的问题

[英]Questions on Logistic Regression

I'm now using the training set from OpenClassroom( http://openclassroom.stanford.edu/MainFolder/DocumentPage.php?course=DeepLearning&doc=exercises/ex4/ex4.html ) to give it a try on Logistic Regression, and I only use LR,unlike that page which uses LR and Newton's methods. 我现在正在使用OpenClassroom( http://openclassroom.stanford.edu/MainFolder/DocumentPage.php?course=DeepLearning&doc=exercises/ex4/ex4.html )中的训练集来尝试Logistic回归,我仅使用LR,与使用LR和Newton方法的页面不同。 below is my code: 下面是我的代码:

from numpy import *
import matplotlib.pyplot as plt

def loadDataSet():
    dataMat = []; labelMat = []
    frX = open('../ex4x.dat')
    frY = open('../ex4y.dat')
    for line1 in frX.readlines():
        lineArr1 = line1.strip().split()
        dataMat.append([1.0, float(lineArr1[0]), float(lineArr1[1])])

    for line2 in frY.readlines():
        lineArr2 = line2.strip().split()
        labelMat.append(float(lineArr2[0]))
    return dataMat,labelMat

def sigmoid(inX):
    return 1.0/(1+exp(-inX))

# def autoNorm(dataSet):
# #   newValue = (oldValue-min)/(max-min)
#     minVals = min(dataSet)
#     maxVals = max(dataSet)
#     ranges = list(map(lambda x: x[0]-x[1], zip(maxVals, minVals)))
#     normDataSet = zeros(shape(dataSet))
#     m,n = shape(dataSet)
#     normDataSet = list(map(lambda x: x[0]-x[1], zip(dataSet,tile(minVals, (m,1)))))
#     normDataSet = normDataSet/tile(ranges, (m,1))
#     return normDataSet, ranges, minVals

def gradDescent(dataMatIn, classLabels):
    x = mat(dataMatIn)
    y = mat(classLabels).transpose()
    m,n = shape(x)
    alpha = 0.001
    maxCycles = 100000
    theta = ones((n,1))
    for k in range(maxCycles):
        h = sigmoid(x*theta)
        error = h - y
        cost = -1*dot(log(h).T,y)-dot((1-y).T,log(1-h))
        print("Iteration %d | Cost: %f" % (k, cost))
        theta = theta - alpha * (x.transpose() * error /m)
    return theta

def plotBestFit(weights):
    dataMat,labelMat=loadDataSet()
    dataArr = array(dataMat)
    n = shape(dataArr)[0]
    xcord1 = []; ycord1 = []
    xcord2 = []; ycord2 = []
    for i in range(n):
        if int(labelMat[i])== 1:
            xcord1.append(dataArr[i,1]);ycord1.append(dataArr[i,2])
        else:
            xcord2.append(dataArr[i,1]);ycord2.append(dataArr[i,2])
    fig = plt.figure()
    ax = fig.add_subplot(111)
    ax.scatter(xcord1, ycord1, s=30, c='red', marker='s')
    ax.scatter(xcord2, ycord2, s=30, c='green')
    min_x = min(mat(dataMat)[:, 1])
    max_x = max(mat(dataMat)[:, 1])
    x = arange(min_x, max_x, 1)
    y = (-weights[0]-weights[1]*x)/weights[2]
    ax.plot(x, y)
    plt.xlabel('X1'); plt.ylabel('X2');
    plt.show()

dataMat, classLabel = loadDataSet()
weights = gradDescent(dataMat, classLabel)
print weights
plotBestFit(weights.getA())

here is my questions: 1. I trained it for 100,000 times, with error was printed each iteration, I didn't see it converaged anyway, well, actually I'm not sure here. 这是我的问题:1.我训练了100,000次,每次迭代都印有错误,但我还是没有看到它收敛,嗯,实际上我不确定。 2. I'm not sure how to paint the classifier correctly by matplotlib, when the maxCycle is 200,000, I can get a somewhat reasonable classifier as well as the maxCyle is 100,000, the paint seems not reasonable at all. 2.我不确定如何通过matplotlib正确绘制分类器,当maxCycle为200,000时,我可以得到一个合理的分类器,而maxCyle为100,000,绘制似乎根本不合理。

maxCycle is 100,000 maxCycle为100,000

UPDATE CODE: 更新代码:

count = 0
for i in range(80):
    result = sigmoid(dataMat[i] * weights)
    if result > 0.5:
        a = 1
    else:
        a = 0

    if float(a) != classLabel[i][0]:
        count += 1
errorRate = (float(count)/80)
print "error count is: %f, error rate is: %f" %(count,errorRate)

Your code is actually fine! 您的代码实际上很好! Here are some remarks: 以下是一些注意事项:

  1. You initialized the thetas with all ones. 您将所有theta初始化了。 I would not do so in this example. 在此示例中,我不会这样做。 The first call of the sigmoid function will return values close to 1 , because the product of theta and x gives very large numbers. sigmoid函数的第一次调用将返回接近1值,因为thetax的乘积给出非常大的数字。 The computation of log(1 - h) can result in error, because log is not defined at 0 . log(1 - h)的计算可能会导致错误,因为log未定义为0 I prefer to initialize thetas with 0's . 我更喜欢用0's初始化thetas。

  2. When calculating the cost function you missed the division by m . 在计算成本函数时,您错过了m的除法。 It does not matter for the algorithm, but it's better to follow the theory. 该算法无关紧要,但是最好遵循该理论。

  3. It's a good idea to plot the cost function, and not just print its values. 绘制成本函数是一个好主意,而不仅仅是打印其值。 The correct trend can be seen very clearly. 可以很清楚地看到正确的趋势。

  4. In order to converge, this particular example needs much more iterations. 为了收敛,此特定示例需要更多的迭代。 I reached a good result at 500.000 iterations. 500.000次迭代中,我取得了不错的成绩。

The post has been updated, see the UPDATE below 帖子已更新,请参见下面的更新

Here are my plots: 这是我的情节:

成本函数

最终隔离边界

As you can see the resulting separation line matches the plot shown in your tutorial very well. 如您所见,最终的分隔线与本教程中显示的图非常匹配。

Here is my code. 这是我的代码。 It differs a little bit from yours, but they are very similar. 它与您的有一些不同,但是它们非常相似。

import numpy as np
import matplotlib.pyplot as plt

def loadDataSet():
    dataMat = []; labelMat = []
    frX = open('../ex4x.dat')
    frY = open('../ex4y.dat')
    for line1 in frX.readlines():
        lineArr1 = line1.strip().split()
        dataMat.append([1.0, float(lineArr1[0]), float(lineArr1[1])])

    for line2 in frY.readlines():
        lineArr2 = line2.strip().split()
        labelMat.append([float(lineArr2[0])])
    return dataMat,labelMat

def sigmoid(inX):
    return 1.0/(1+np.exp(-inX))    

def gradDescent(dataMatIn, classLabels, alpha, maxCycles):
    x = np.mat(dataMatIn)
    y = np.mat(classLabels)
    m,n = np.shape(x)
    n = n - 1               #usually n is the number of features (without the 1's)

    theta = np.zeros((n+1,1))

    cost_history = []       #list to accumulate the cost values

    for k in range(maxCycles):

        h = sigmoid(x*theta)

        cost = ((-np.multiply(y, np.log(h)) -np.multiply(1-y, np.log(1-h))).sum(axis=0)/m)[0, 0]

        if ((k % 1000) == 0):
            cost_history.append(cost)   #on each 1000th iteration the cost is saved to a list

        grad = (x.transpose() * (h - y))/m

        theta = theta - alpha*grad

    plot_cost = 1 
    if (plot_cost == 1):
        plt.plot(cost_history)
        plt.title("Cost")
        plt.show()

    return theta   

def plotBestFit(dataMat, classLabel, weights):
    arrY = np.asarray(classLabel)
    arrX = np.asarray(dataMat)
    ind1 = np.where(arrY == 1)[0]
    ind0 = np.where(arrY == 0)[0]

    min_x1 = min(np.mat(dataMat)[:, 1])
    max_x1 = max(np.mat(dataMat)[:, 1])
    x1_val = np.arange(min_x1, max_x1, 1)
    x2_val = (-weights[0, 0]-weights[1, 0]*x1_val)/weights[2, 0]

    plt.scatter(arrX[ind1, 1], arrX[ind1, 2], s=30, c='red', marker='s')
    plt.scatter(arrX[ind0, 1], arrX[ind0, 2], s=30, c='blue', marker='s')
    plt.plot(x1_val, x2_val)
    plt.xlabel('X1', fontsize=18)
    plt.ylabel('X2', fontsize=18)
    plt.title("Separation border")
    plt.show()


dataMat, classLabel = loadDataSet()
weights = gradDescent(dataMat, classLabel, 0.0014, 500000) 

print(weights)
plotBestFit(dataMat, classLabel, weights)

UPDATE UPDATE

After reading your questions in the comments to the first edition of the post I tried to optimize the code to achieve the convergence of the cost function using much smaller number of iterations. 在阅读该帖子第一版的注释中的问题之后,我尝试使用更少的迭代次数来优化代码以实现成本函数的收敛。

Indeed the feature standardization makes miracles :) 确实,功能标准化创造了奇迹:)

An even better result was achieved after only 30 iterations! 仅经过30次迭代,就获得了更好的结果!

Here are the new plots: 这是新的情节:

在此处输入图片说明

在此处输入图片说明

Because of the standardization you need to scale each new test example, in order to classify it. 由于标准化,您需要扩展每个新的测试示例以对其进行分类。

Here is the new code. 这是新代码。 I changed some data types to avoid unnecessary data type conversions. 我更改了一些数据类型,以避免不必要的数据类型转换。

import numpy as np
import matplotlib.pyplot as plt

def loadDataSet():
    dataMat = []; labelMat = []
    frX = open('../ex4x.dat')
    frY = open('../ex4y.dat')
    for line1 in frX.readlines():
        lineArr1 = line1.strip().split()
        dataMat.append([1.0, float(lineArr1[0]), float(lineArr1[1])])

    for line2 in frY.readlines():
        lineArr2 = line2.strip().split()
        labelMat.append([float(lineArr2[0])])

    return np.asarray(dataMat), np.asarray(labelMat)

def sigmoid(inX):
    return 1.0/(1+np.exp(-inX))    

def gradDescent(x, y, alpha, maxCycles):

    m,n = np.shape(x)
    n = n - 1               #usually n is the number of features (without the 1's)

    theta = np.zeros((n+1,1))

    cost_history = []       #list to accumulate the cost values
    cost_iter = []

    for k in range(maxCycles):

        h = sigmoid(np.dot(x, theta))

        cost = np.sum(-np.multiply(y, np.log(h)) -np.multiply(1-y, np.log(1-h)))/m


        cost_history.append(cost)   #on each 1000th iteration the cost is saved to a list
        cost_iter.append(k)

        grad = np.dot(x.transpose(), (h - y))/m

        theta = theta - alpha*grad

    plot_cost = 1 
    if (plot_cost == 1):
        plt.plot(cost_iter, cost_history)
        plt.title("Cost")
        plt.show()

    return theta   

def plotBestFit(arrX, arrY, weights):

    ind1 = np.where(arrY == 1)[0]
    ind0 = np.where(arrY == 0)[0]

    min_x1 = min(arrX[:, 1:2])
    max_x1 = max(arrX[:, 1:2])
    x1_val = np.arange(min_x1, max_x1, 0.1)
    x2_val = (-weights[0, 0]-weights[1, 0]*x1_val)/weights[2, 0]

    plt.scatter(arrX[ind1, 1], arrX[ind1, 2], s=30, c='red', marker='s')
    plt.scatter(arrX[ind0, 1], arrX[ind0, 2], s=30, c='blue', marker='s')
    plt.plot(x1_val, x2_val)
    plt.xlabel('X1', fontsize=18)
    plt.ylabel('X2', fontsize=18)
    plt.title("Separation border")
    plt.show()


dataMat, classLabel = loadDataSet()
m = np.shape(dataMat)[0]

#standardization
dataMatMean = np.mean(dataMat, axis=0)
dataMatStd = np.std(dataMat, axis=0)

dataMatMean_m = np.tile(dataMatMean, (m, 1))
dataMatStd_m = np.tile(dataMatStd, (m, 1))

dataMatStand = np.copy(dataMat)
dataMatStand[:, 1:3] = np.divide(  (dataMatStand[:, 1:3] - dataMatMean_m[:, 1:3]),   dataMatStd_m[:, 1:3])

weights = gradDescent(dataMatStand, classLabel, 1.0, 30) 

print(weights)
plotBestFit(dataMatStand, classLabel, weights)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM