简体   繁体   English

百分比回归-r,python和matlab中的结果不同

[英]regression for percentages - different results in r, python and matlab

I have percentages and need to calculate a regression. 我有百分比,需要计算回归。 According to basic statistics using logistic regression is better than OLS as percentages invalidate the requirement of a continuous and unconstraint value space. 根据基本统计数据,使用逻辑回归比OLS更好,因为百分比会使对连续且不受约束的值空间的要求无效。

So far, so good. 到现在为止还挺好。 However, I get different results in R, Python, and Matlab. 但是,我在R,Python和Matlab中得到了不同的结果。 In fact, Matlab even reports significant values where python will not. 实际上,Matlab甚至报告了python不会报告的重要值。

My models look like: 我的模型如下:

R:
summary(glm(foo ~ 1 + bar + baz  , family = "binomial", data = <<data>>))

Python via statsmodels:
smf.logit('foo ~ 1 + bar + baz', <<data>>).fit().summary()

Matlab:
fitglm(<<data>>,'foo ~ 1 + bar + baz','Link','logit')

where Matlab currently produces the best results. Matlab当前产生最佳结果的地方。

Could there be different initialization values? 可以有不同的初始化值吗? Different solvers? 不同的求解器? Different settings for alphas when computing p-values? 计算p值时alpha的不同设置? How can I get the same results at least in similar numeric ranges or same features detected as significant? 如何至少在相似的数值范围或检测到的重要特征相同的情况下获得相同的结果? I do not require exact equal numeric output. 我不需要完全相等的数值输出。

edit 编辑

the summary statistics 汇总统计

python:
Dep. Variable:  foo No. Observations:   104
Model:  Logit   Df Residuals:   98
Method: MLE Df Model:   5
Date:   Wed, 28 Aug 2019    Pseudo R-squ.:  inf
Time:   06:48:12    Log-Likelihood: -0.25057
converged:  True    LL-Null:    0.0000
LLR p-value:    1.000
coef    std err z   P>|z|   [0.025  0.975]
Intercept   -16.9863    154.602 -0.110  0.913   -320.001    286.028
bar -0.0278 0.945   -0.029  0.977   -1.880  1.824
baz 18.5550 280.722 0.066   0.947   -531.650    568.760
a   9.9996  153.668 0.065   0.948   -291.184    311.183
b   0.6757  132.542 0.005   0.996   -259.102    260.454
d   0.0005  0.039   0.011   0.991   -0.076  0.077


R:
glm(formula = myformula, family = "binomial", data = r_x)

Deviance Residuals: 
      Min         1Q     Median         3Q        Max  
-0.046466  -0.013282  -0.001017   0.006217   0.104467  

Coefficients:
                                       Estimate Std. Error z value Pr(>|z|)
(Intercept)                          -1.699e+01  1.546e+02  -0.110    0.913
bar                     -2.777e-02  9.449e-01  -0.029    0.977
baz                               1.855e+01  2.807e+02   0.066    0.947
a                       1.000e+01  1.537e+02   0.065    0.948
b                       6.757e-01  1.325e+02   0.005    0.996
d  4.507e-04  3.921e-02   0.011    0.991

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 0.049633  on 103  degrees of freedom
Residual deviance: 0.035684  on  98  degrees of freedom
AIC: 12.486

Matlab:
Estimated Coefficients:
                                            Estimate         SE         tStat        pValue  
                                            _________    __________    ________    __________

    (Intercept)                               -21.044         3.315     -6.3483    6.8027e-09
    bar                        -0.033507      0.022165     -1.5117       0.13383
    d    0.0016149    0.00083173      1.9416      0.055053
    baz                                    21.427        6.0132      3.5632    0.00056774
    a                            14.875        3.7828      3.9322    0.00015712
    b                           -1.2126        2.7535    -0.44038       0.66063


104 observations, 98 error degrees of freedom
Estimated Dispersion: 1.25e-06
F-statistic vs. constant model: 7.4, p-value = 6.37e-06

You are not actually using the binomial distribution in the MATLAB case. 在MATLAB情况下,您实际上并没有使用二项式分布。 You are specifying the link function, but the distribution remains its default value for a normal distribution, which will not give you the expected logistic fit, at least if the sample sizes for the percentages are small. 您正在指定链接函数,但是该分布仍然是其正态分布的默认值,至少在百分比的样本量较小的情况下,该分布不会提供预期的逻辑拟合。 It is also giving you lower p-values, because the normal distribution is less constrained in its variance than the binomial distribution is. 它也为您提供了较低的p值,因为正态分布的方差比二项分布更受约束。

You need to specify the Distribution argument to Binomial : 您需要将Distribution参数指定为Binomial

fitglm(<<data>>, 'foo ~ 1 + bar + baz', 'Distribution', 'binomial ', 'Link', 'logit')

The R and Python code seem to match rather well. R和Python代码似乎匹配得很好。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM