简体繁体 English

使用MATLAB进行简单的二元逻辑回归

[英]Simple binary logistic regression using MATLAB

原文 2014-03-19 00:15:24 4 1 matlab/ classification/ probability/ confidence-interval/ logistic-regression

I'm working on doing a logistic regression using MATLAB for a simple classification problem. 我正在使用MATLAB进行逻辑回归，以解决一个简单的分类问题。 My covariate is one continuous variable ranging between 0 and 1, while my categorical response is a binary variable of 0 (incorrect) or 1 (correct). 我的协变量是一个介于0和1之间的连续变量，而我的分类响应是0（不正确）或1（正确）的二进制变量。

I'm looking to run a logistic regression to establish a predictor that would output the probability of some input observation (eg the continuous variable as described above) being correct or incorrect. 我正在寻找逻辑回归来建立预测器，该预测器将输出某些输入观察的概率（例如，如上所述的连续变量）是正确的或不正确的。 Although this is a fairly simple scenario, I'm having some trouble running this in MATLAB. 虽然这是一个相当简单的场景，但我在MATLAB中运行它时遇到了一些麻烦。

My approach is as follows: I have one column vector X that contains the values of the continuous variable, and another equally-sized column vector Y that contains the known classification of each value of X (eg 0 or 1). 我的方法如下：我有一个列向量X ，它包含连续变量的值，另一个大小相等的列向量Y包含每个X值的已知分类（例如0或1）。 I'm using the following code: 我正在使用以下代码：

[b,dev,stats] = glmfit(X,Y,'binomial','link','logit');

However, this gives me nonsensical results with a p = 1.000 , coefficients ( b ) that are extremely high (-650.5, 1320.1), and associated standard error values on the order of 1e6. 然而，这给出了无意义的结果，其中p = 1.000 ，系数（ b ）非常高（-650.5,1320.1），并且相关的标准误差值大约为1e6。

I then tried using an additional parameter to specify the size of my binomial sample: 然后我尝试使用其他参数来指定二项式样本的大小：

glm = GeneralizedLinearModel.fit(X,Y,'distr','binomial','BinomialSize',size(Y,1));

This gave me results that were more in line with what I expected. 这给了我更符合我的预期的结果。 I extracted the coefficients, used glmval to create estimates ( Y_fit = glmval(b,[0:0.01:1],'logit'); ), and created an array for the fitting ( X_fit = linspace(0,1) ). 我提取系数，使用glmval创建估计值（ Y_fit = glmval(b,[0:0.01:1],'logit'); ），并为拟合创建一个数组（ X_fit = linspace(0,1) ）。 When I overlaid the plots of the original data and the model using figure, plot(X,Y,'o',X_fit,Y_fit'-') , the resulting plot of the model essentially looked like the lower 1/4th of the 'S' shaped plot that is typical with logistic regression plots. 当我使用figure, plot(X,Y,'o',X_fit,Y_fit'-')重叠原始数据和模型的figure, plot(X,Y,'o',X_fit,Y_fit'-')模型的结果图基本上看起来像''的1/4。具有逻辑回归图的典型S形图。

My questions are as follows: 我的问题如下：

1) Why did my use of glmfit give strange results? 1）为什么我使用glmfit产生奇怪的结果？
2) How should I go about addressing my initial question: given some input value, what's the probability that its classification is correct? 2）我应该如何解决我的初始问题：给定一些输入值，它的分类是正确的概率是多少？
3) How do I get confidence intervals for my model parameters? 3）如何获得模型参数的置信区间？ glmval should be able to input the stats output from glmfit , but my use of glmfit is not giving correct results. glmval应该能够输入glmfit的stats输出，但是我使用glmfit并没有给出正确的结果。

Any comments and input would be very useful, thanks! 任何评论和意见都非常有用，谢谢！

UPDATE (3/18/14) 更新（2014年3月18日）

I found that mnrval seems to give reasonable results. 我发现mnrval似乎给出了合理的结果。 I can use [b_fit,dev,stats] = mnrfit(X,Y+1); 我可以用[b_fit,dev,stats] = mnrfit(X,Y+1); where Y+1 simply makes my binary classifier into a nominal one. 其中Y+1简单地将我的二元分类器变为名义分类器。

I can loop through [pihat,lower,upper] = mnrval(b_fit,loopVal(ii),stats); 我可以遍历[pihat,lower,upper] = mnrval(b_fit,loopVal(ii),stats); to get various pihat probability values, where loopVal = linspace(0,1) or some appropriate input range and `ii = 1:length(loopVal)'. 获得各种pihat概率值，其中loopVal = linspace(0,1)或一些适当的输入范围和`ii = 1：length（loopVal）'。

The stats parameter has a great correlation coefficient (0.9973), but the p values for b_fit are 0.0847 and 0.0845, which I'm not quite sure how to interpret. 该stats参数有很大的相关系数（0.9973），但对于P值b_fit是0.0847和0.0845，这我不太知道如何解释。 Any thoughts? 有什么想法吗？ Also, why would mrnfit work over glmfit in my example? 另外，为什么会mrnfit在工作glmfit在我的例子吗？ I should note that the p-values for the coefficients when using GeneralizedLinearModel.fit were both p<<0.001 , and the coefficient estimates were quite different as well. 我应该注意到，当使用GeneralizedLinearModel.fit时，系数的p值都是p<<0.001 ，系数估计也是非常不同的。

Finally, how does one interpret the dev output from the mnrfit function? 最后，如何解释mnrfit函数的dev输出？ The MATLAB document states that it is "the deviance of the fit at the solution vector. The deviance is a generalization of the residual sum of squares." MATLAB文件指出它是“解决方案向量的拟合偏差。偏差是残差平方和的推广。” Is this useful as a stand-alone value, or is this only compared to dev values from other models? 这是一个独立的值，还是仅与其他模型的dev值进行比较？

1 个解决方案

It sounds like your data may be linearly separable. 听起来您的数据可能是线性可分的。 In short, that means since your input data is one dimensional, that there is some value of x such that all values of x < xDiv belong to one class (say y = 0 ) and all values of x > xDiv belong to the other class ( y = 1 ). 简而言之，这意味着由于您的输入数据是一维的，因此存在x某个值，使得x < xDiv所有值x < xDiv属于一个类（例如y = 0 ），并且x > xDiv所有值x > xDiv属于另一个类（ y = 1 ）。

If your data were two-dimensional this means you could draw a line through your two-dimensional space X such that all instances of a particular class are on one side of the line. 如果您的数据是二维的，这意味着您可以在二维空间X绘制一条线，使得特定类的所有实例都位于该线的一侧。

This is bad news for logistic regression (LR) as LR isn't really meant to deal with problems where the data are linearly separable. 这对于逻辑回归（LR）来说是个坏消息，因为LR并不是真正意味着处理数据可线性分离的问题。

Logistic regression is trying to fit a function of the following form: Logistic回归试图拟合以下形式的函数：

Logistic回归

This will only return values of y = 0 or y = 1 when the expression within the exponential in the denominator is at negative infinity or infinity. 当分母中指数内的表达式为负无穷大或无穷大时，这将仅返回y = 0或y = 1值。

Now, because your data is linearly separable, and Matlab's LR function attempts to find a maximum likelihood fit for the data, you will get extreme weight values. 现在，因为您的数据是线性可分的，并且Matlab的LR函数试图找到适合数据的最大似然值，您将获得极端权重值。

This isn't necessarily a solution, but try flipping the labels on just one of your data points (so for some index t where y(t) == 0 set y(t) = 1 ). 这不一定是一个解决方案，但尝试只在您的一个数据点上翻转标签（因此对于某些索引t ，其中y(t) == 0设置y(t) = 1 ）。 This will cause your data to no longer be linearly separable and the learned weight values will be dragged dramatically closer to zero. 这将导致您的数据不再是线性可分的，并且学习的权重值将被显着拖得更接近于零。