简体   繁体   English

sklearn上的Logistic回归函数

[英]Logistic Regression function on sklearn

I am learning Logistic Regression from sklearn and came across this : http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html#sklearn.linear_model.LogisticRegression 我正在从sklearn学习Logistic回归并且遇到了这个: http ://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html#sklearn.linear_model.LogisticRegression

I have a created an implementation which shows me the accuracy scores for training and testing. 我创建了一个实现,向我展示了培训和测试的准确度分数。 However it is very unclear how this was achieved. 然而,目前尚不清楚这是如何实现的。 My question is : What is the Maximum likelihood estimate? 我的问题是:最大似然估计是多少? How is this being calculated? 这是如何计算的? What is the error measure? 什么是错误衡量标准? What is the optimisation algorithm used? 使用的优化算法是什么?

I know all of the above in theory, however I am not sure where and when and how scikit.learn calculates it, or if its something I need to implement at some point. 我在理论上知道以上所有内容,但是我不知道scikit.learn在何时何地计算它,或者它是否需要在某些时候实现它。 I have an accuracy rate of 83% which was what I was aiming for but I am very confused about how this was achieved by scikit learn. 我的准确率为83%,这正是我的目标,但我对scikit学习如何实现这一点非常困惑。

Would anyone be able to point me in the right direction? 有人能指出我正确的方向吗?

I recently started studying LR myself, I still don't get many steps of the derivation but I think I understand which formulas are being used. 我最近开始自己学习LR,我仍然没有得到很多推导步骤,但我想我明白了哪些公式正在被使用。

First of all let's assume that you are using the latest version of scikit-learn and that the solver being used is solver='lbfgs' (which is the default I believe). 首先让我们假设您使用的是最新版本的scikit-learn,并且使用的求解器是solver='lbfgs' (我相信这是默认值)。

The code is here: https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/linear_model/logistic.py 代码在这里: https//github.com/scikit-learn/scikit-learn/blob/master/sklearn/linear_model/logistic.py

What is the Maximum likelihood estimate? 什么是最大似然估计? How is this being calculated? 这是如何计算的?

The function to compute the likelihood estimate is this one https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/linear_model/logistic.py#L57 计算似然估计的函数是这一个https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/linear_model/logistic.py#L57

The interesting line is: 有趣的是:

# Logistic loss is the negative of the log of the logistic function.
out = -np.sum(sample_weight * log_logistic(yz)) + .5 * alpha * np.dot(w, w)

which is the formula 7 of this tutorial . 这是本教程公式7 The function also computes the gradient of the likelihood, which is then passed to the minimization function (see below). 该函数还计算似然的梯度,然后将其传递给最小化函数(见下文)。 One important thing is that the intercept is w0 of the formulas in the tutorial. 一个重要的是截距是教程中公式的w0 But that's only valid fit_intercept is True. 但这只是有效的fit_intercept是True。

What is the error measure? 什么是错误衡量标准?

I'm sorry I'm not sure. 对不起,我不确定。

What is the optimisation algorithm used? 使用的优化算法是什么?

See the following lines in the code: https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/linear_model/logistic.py#L389 请参阅代码中的以下行: https//github.com/scikit-learn/scikit-learn/blob/master/sklearn/linear_model/logistic.py#L389

It's this function http://docs.scipy.org/doc/scipy/reference/generated/scipy.optimize.fmin_l_bfgs_b.html 这是函数http://docs.scipy.org/doc/scipy/reference/generated/scipy.optimize.fmin_l_bfgs_b.html

One very important thing is that the classes are +1 or -1! 一个非常重要的是类是+1或-1! (for the binary case, in the literature 0 and 1 are common, but it won't work) (对于二进制情况,在文献0和1中很常见,但它不起作用)

Also notice that numpy broadcasting rules are used in all formulas. 另请注意,所有公式中都使用了numpy广播规则。 (That's why you don't see iteration) (这就是你没有看到迭代的原因)

This was my attempt at understanding the code. 这是我尝试理解代码。 I slowly went mad till the point of ripping appart scikit-learn code (in only works for the binary case). 我慢慢地发疯,直到撕开appart scikit-learn代码(仅适用于二进制案例)。 Also this served as inspiration too 这也是灵感

Hope it helps. 希望能帮助到你。

Check out Prof. Andrew Ng's machine learning notes on Logistic Regression (starting from page 16): http://cs229.stanford.edu/notes/cs229-notes1.pdf 查看Andrew Ng教授关于Logistic回归的机器学习笔记(从第16页开始): http//cs229.stanford.edu/notes/cs229-notes1.pdf

In logistic regression you minimize cross entropy (which in turn maximizes the likelihood of y given x). 在逻辑回归中,最小化交叉熵(这反过来最大化给定x的y的可能性)。 In order to do this, the gradient of the cross entropy (cost) function is being computed and is used to update the weights of the algorithm which are assigned to each input. 为此,计算交叉熵(成本)函数的梯度,并用于更新分配给每个输入的算法的权重。 In simple terms, logistic regression comes up with a line that best discriminates your two binary classes by changing around its parameters such that the cross entropy keeps going down. 简单来说,逻辑回归提出了一条线,通过改变其参数来最好区分您的两个二进制类,使得交叉熵不断下降。 The 83% accuracy (i'm not sure what accuracy that is; you should be diving your data into training/validation/testing) means the line Logistic Regression is using for classification can correctly separate the classes 83% of the time. 83%的准确度(我不确定是什么准确性;您应该将数据潜入培训/验证/测试)意味着Logistic回归用于分类的线可以在83%的时间内正确地分类。

I would have a look at the following on github : 我将在github上看一下以下内容:

https://github.com/scikit-learn/scikit-learn/blob/965b109bf2ac3a61dcbd02bc29dd8c9598c2b54c/sklearn/linear_model/logistic.py https://github.com/scikit-learn/scikit-learn/blob/965b109bf2ac3a61dcbd02bc29dd8c9598c2b54c/sklearn/linear_model/logistic.py

The link is to the implementation of sklearn logictic regression. 链接是sklearn logictic回归的实现。 It contains the optimization algorithms used which include newton conjugate gradient (newton-cg) and bfgs (broyden fletcher goldfarb shanno algorithm) all of which require the calculation of the hessian of the loss function (_logistic_loss) . 它包含使用的优化算法,包括牛顿共轭梯度(newton-cg)和bfgs(broyden fletcher goldfarb shanno算法),所有这些都需要计算损失函数的粗糙度(_logistic_loss)。 _logistic_loss is your likelihood function. _logistic_loss是你的似然函数。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM