[英]regression for percentages - different results in r, python and matlab
I have percentages and need to calculate a regression. 我有百分比,需要计算回归。 According to basic statistics using logistic regression is better than OLS as percentages invalidate the requirement of a continuous and unconstraint value space.
根据基本统计数据,使用逻辑回归比OLS更好,因为百分比会使对连续且不受约束的值空间的要求无效。
So far, so good. 到现在为止还挺好。 However, I get different results in R, Python, and Matlab.
但是,我在R,Python和Matlab中得到了不同的结果。 In fact, Matlab even reports significant values where python will not.
实际上,Matlab甚至报告了python不会报告的重要值。
My models look like: 我的模型如下:
R:
summary(glm(foo ~ 1 + bar + baz , family = "binomial", data = <<data>>))
Python via statsmodels:
smf.logit('foo ~ 1 + bar + baz', <<data>>).fit().summary()
Matlab:
fitglm(<<data>>,'foo ~ 1 + bar + baz','Link','logit')
where Matlab currently produces the best results. Matlab当前产生最佳结果的地方。
Could there be different initialization values? 可以有不同的初始化值吗? Different solvers?
不同的求解器? Different settings for alphas when computing p-values?
计算p值时alpha的不同设置? How can I get the same results at least in similar numeric ranges or same features detected as significant?
如何至少在相似的数值范围或检测到的重要特征相同的情况下获得相同的结果? I do not require exact equal numeric output.
我不需要完全相等的数值输出。
the summary statistics 汇总统计
python:
Dep. Variable: foo No. Observations: 104
Model: Logit Df Residuals: 98
Method: MLE Df Model: 5
Date: Wed, 28 Aug 2019 Pseudo R-squ.: inf
Time: 06:48:12 Log-Likelihood: -0.25057
converged: True LL-Null: 0.0000
LLR p-value: 1.000
coef std err z P>|z| [0.025 0.975]
Intercept -16.9863 154.602 -0.110 0.913 -320.001 286.028
bar -0.0278 0.945 -0.029 0.977 -1.880 1.824
baz 18.5550 280.722 0.066 0.947 -531.650 568.760
a 9.9996 153.668 0.065 0.948 -291.184 311.183
b 0.6757 132.542 0.005 0.996 -259.102 260.454
d 0.0005 0.039 0.011 0.991 -0.076 0.077
R:
glm(formula = myformula, family = "binomial", data = r_x)
Deviance Residuals:
Min 1Q Median 3Q Max
-0.046466 -0.013282 -0.001017 0.006217 0.104467
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -1.699e+01 1.546e+02 -0.110 0.913
bar -2.777e-02 9.449e-01 -0.029 0.977
baz 1.855e+01 2.807e+02 0.066 0.947
a 1.000e+01 1.537e+02 0.065 0.948
b 6.757e-01 1.325e+02 0.005 0.996
d 4.507e-04 3.921e-02 0.011 0.991
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 0.049633 on 103 degrees of freedom
Residual deviance: 0.035684 on 98 degrees of freedom
AIC: 12.486
Matlab:
Estimated Coefficients:
Estimate SE tStat pValue
_________ __________ ________ __________
(Intercept) -21.044 3.315 -6.3483 6.8027e-09
bar -0.033507 0.022165 -1.5117 0.13383
d 0.0016149 0.00083173 1.9416 0.055053
baz 21.427 6.0132 3.5632 0.00056774
a 14.875 3.7828 3.9322 0.00015712
b -1.2126 2.7535 -0.44038 0.66063
104 observations, 98 error degrees of freedom
Estimated Dispersion: 1.25e-06
F-statistic vs. constant model: 7.4, p-value = 6.37e-06
You are not actually using the binomial distribution in the MATLAB case. 在MATLAB情况下,您实际上并没有使用二项式分布。 You are specifying the link function, but the distribution remains its default value for a normal distribution, which will not give you the expected logistic fit, at least if the sample sizes for the percentages are small.
您正在指定链接函数,但是该分布仍然是其正态分布的默认值,至少在百分比的样本量较小的情况下,该分布不会提供预期的逻辑拟合。 It is also giving you lower p-values, because the normal distribution is less constrained in its variance than the binomial distribution is.
它也为您提供了较低的p值,因为正态分布的方差比二项分布更受约束。
You need to specify the Distribution
argument to Binomial
: 您需要将
Distribution
参数指定为Binomial
:
fitglm(<<data>>, 'foo ~ 1 + bar + baz', 'Distribution', 'binomial ', 'Link', 'logit')
The R and Python code seem to match rather well. R和Python代码似乎匹配得很好。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.