简体   繁体   English

没有截距的逻辑回归给出了合适的警告信息

[英]Logistic regression without an intercept gives fitting warning message

I am trying to run the logistic regression without an intercept.我正在尝试在没有拦截的情况下运行逻辑回归。 Firstly, I tried the function glm but I got the following error:首先,我尝试了函数glm但出现以下错误:

    Warning message:        
    glm.fit: fitted probabilities numerically 0 or 1 occurred       

Since it was not possible to change the data set at all given the nature of my work, I decided to use a different R program package which had the code bayesglm .鉴于我的工作性质,根本不可能更改数据集,因此我决定使用具有代码bayesglm的不同 R 程序包。

When I use this function including the intercept, I get no error message as above.当我使用包括拦截在内的这个函数时,我没有收到上述错误消息。 However, when I exclude the intercept by adding -1 at the end of my function I still get the same error above with the following output:但是,当我通过在函数末尾添加-1排除拦截时,我仍然收到与上面相同的错误,并显示以下输出:

    > regress=bayesglm(y~x1*x2+x3+x4-1, data = DATA, family=binomial(link="logit"))     
    > summary(regress)      

    Call:       
    bayesglm(formula = y ~ x1 * x2 + x3 + x4 - 1, family = binomial(link = "logit"),        
        data = DATA, maxit = 10000)     

    Deviance Residuals:         
         Min        1Q    Median        3Q       Max        
    -1.01451  -0.43143  -0.22778  -0.05431   2.89066        

    Coefficients:       
             Estimate Std. Error z value Pr(>|z|)           
    x1      -20.45537    9.70594  -2.108  0.03507 *         
    x2       -7.04844    2.87415  -2.452  0.01419 *         
    x1:x2     0.13409   17.57010   0.008  0.99391           
    x3       -0.17779    0.06377  -2.788  0.00531 **        
    x4       -0.02593    0.05313  -0.488  0.62548           
    ---     
    Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1      

    (Dispersion parameter for binomial family taken to be 1)        

        Null deviance: 494.91  on 357  degrees of freedom       
    Residual deviance: 124.93  on 352  degrees of freedom       
      (165 observations deleted due to missingness)     
    AIC: 134.93     

    Number of Fisher Scoring iterations: 123        

and get the same error as below:并得到与以下相同的错误:

    Warning message:        
    glm.fit: fitted probabilities numerically 0 or 1 occurred       

which I do not get if I do not add -1 to remove the intercept.如果我不添加-1来删除拦截,我就不会得到。

Therefore, I have two questions to ask:因此,我有两个问题要问:

1. Is it possible for me to ignore this warning message? 1. 我可以忽略这个警告信息吗?

2. Otherwise, may I know how I can fix the problem according to this warning message? 2. 否则,我可以根据此警告信息知道如何解决问题吗?

The correct answer to this question is that the intercept should not be removed in a logistic regression.这个问题的正确答案是在逻辑回归中不应该删除截距。 Fixing the warning message without fixing the mis-specification of the model is not appropriate practice.修复警告消息而不修复模型的错误规范是不合适的做法。

In a logistic regression done properly, this error message can show up when there is perfect separation (combinations of predictors that completely explain class membership in the data sample at hand), and there are well established ways to deal with this phenomenon as explained for example on this page .在正确执行逻辑回归中,当存在完美分离(完全解释手头数据样本中的类成员的预测变量的组合)时会出现此错误消息,并且有很好的方法来处理这种现象,例如在此页面上

It is, however, inappropriate to remove the intercept in a logistic regression model.然而,在逻辑回归模型中去除截距是不合适的。 See this page and the extensive discussion in comments on the duplicate posting of this question on Cross Validated , in particular https://stats.stackexchange.com/questions/11109/how-to-deal-with-perfect-separation-in-logistic-regression which includes many suggestions.请参阅此页面以及在Cross Validated上重复发布此问题的评论中的广泛讨论,特别是https://stats.stackexchange.com/questions/11109/how-to-deal-with-perfect-separation-in-逻辑回归,其中包括许多建议。

I will try to provide an answer to the question.我会尽力回答这个问题。

What does the warning mean?警告是什么意思? The warning is given when numerical precision might be in question for certain observations.当某些观察的数值精度可能存在问题时,会发出警告。 More precisely, it is given in the case where the fitted model, returns probability of 1 - epsilon or equivalently 0 + epsilon.更准确地说,它是在拟合模型返回概率 1 - epsilon 或等价于 0 + epsilon 的情况下给出的。 As standard this bound is 1-10^-8 and 10^-8 respectively (as given by glm.control) for the standard glm.fit function.作为标准,对于标准 glm.fit 函数,此界限分别为 1-10^-8 和 10^-8(由 glm.control 给出)。

When may this happen?这什么时候会发生? To my experience the case where this happen most often, is the case where factors (or dummy variables) are included, for which only one outcome is observed in one catagory.根据我的经验,这种情况最常发生的情况是包含因素(或虚拟变量)的情况,在一个类别中只观察到一个结果。 This happens most often when interactions are included in factors of many levels, and limited data for the analysis.当交互包含在多个级别的因素中并且用于分析的数据有限时,这种情况最常发生。 Similarly if one has many variables compared to the number of observations (counting used variables, interactions transformations etc. as individual variables, so the total number will be the sum of all of these), a similar image will be possible.类似地,如果与观察的数量相比有许多变量(将使用的变量、交互变换等作为单个变量计数,因此总数将是所有这些的总和),则可能会出现类似的图像。 In your case, if you have factors, removing the intercept will adds 1 level to each factor, which might reduce precision near the probability edge case of 0 and 1. In short if for some part of our data, we have no (or little) uncertainty, then this warning will give us an indication.在您的情况下,如果您有因素,则删除截距将为每个因素增加 1 个级别,这可能会降低 0 和 1 的概率边缘情况附近的精度。简而言之,如果对于我们的某些部分数据,我们没有(或几乎没有) ) 不确定性,那么这个警告会给我们一个指示。

Can i ignore it otherwise how can i fix it?我可以忽略它,否则我该如何解决? This is dependent on the problem at hand, and the scale of the problem.这取决于手头的问题和问题的规模。 Several sources, like John Fox , will likely consider these observations possible outliers, and with good arguments suggests removing these after using influence measures (availible in the car package for base glm) or performing some outlier tests (also availible in the car package for base glm), if this is an option within your field of work.一些消息来源,像约翰·福克斯,可能会考虑这些意见可能的异常,并有很好的论据建议使用影响的措施(在菱之后,除去这些car包基地GLM)或执行一些异常测试(在还菱car包基地glm),如果这是您工作领域内的一个选项。 If these shows them to not influence the fit, you would not remove them, as there would be no statistical argument for doing so.如果这些显示它们不影响拟合,您不会删除它们,因为这样做没有统计论据。

If outlier removal is not an option in your field of work, then a reduced model (less variables in general) might help if this is the cause, or if the number of factors are the cause merging levels within factors might give some better results.如果在您的工作领域中不能选择去除异常值,那么如果这是原因,或者如果因子数量是原因,则简化模型(一般来说变量较少)可能会有所帮助,在因子内合并级别可能会产生更好的结果。

Other sources might have other suggestions, but John Fox is a credible source on the subject for these model types.其他来源可能有其他建议,但John Fox是有关这些模型类型主题的可靠来源。 It becomes a question of 'Is my model correctly specified?', 'How severely does it affect my model?'它变成了“我的模型是否正确指定?”、“它对我的模型的影响有多严重?”的问题。 and 'How much are you allowed to do in your line of work?', while following the general theory and guidelines within statistics.和“你在你的工作中可以做多少?”,同时遵循统计学中的一般理论和指导方针。 Probabilities close to 0 and 1 are less likely to be precise and more likely to be due to numerical impression, but if these are not the cases that you are likely to predict, and there is no significant effect on the remainder of the model, this is not necessarily a problem and may be ignored.接近 0 和 1 的概率不太可能是精确的,更可能是由于数字印象,但如果这些不是您可能预测的情况,并且对模型的其余部分没有显着影响,这不一定是问题,可能会被忽略。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM