![](/img/trans.png)
[英]Why can't statsmodels reproduce my R logistic regression results?
[英]Why are the logistic regression results different between statsmodels and R?
我正在嘗試比較 python 的 statsmodels 和 R 中的邏輯回歸實現。
蟒蛇版本:
import statsmodels.api as sm
import pandas as pd
import pylab as pl
import numpy as np
df = pd.read_csv("http://www.ats.ucla.edu/stat/data/binary.csv")
df.columns = list(df.columns)[:3] + ["prestige"]
# df.hist()
# pl.show()
dummy_ranks = pd.get_dummies(df["prestige"], prefix="prestige")
cols_to_keep = ["admit", "gre", "gpa"]
data = df[cols_to_keep].join(dummy_ranks.ix[:, "prestige_2":])
data["intercept"] = 1.0
train_cols = data.columns[1:]
logit = sm.Logit(data["admit"], data[train_cols])
result = logit.fit()
result.summary2()
結果:
Results: Logit
=================================================================
Model: Logit Pseudo R-squared: 0.083
Dependent Variable: admit AIC: 470.5175
Date: 2014-12-19 01:11 BIC: 494.4663
No. Observations: 400 Log-Likelihood: -229.26
Df Model: 5 LL-Null: -249.99
Df Residuals: 394 LLR p-value: 7.5782e-08
Converged: 1.0000 Scale: 1.0000
No. Iterations: 6.0000
------------------------------------------------------------------
Coef. Std.Err. z P>|z| [0.025 0.975]
------------------------------------------------------------------
gre 0.0023 0.0011 2.0699 0.0385 0.0001 0.0044
gpa 0.8040 0.3318 2.4231 0.0154 0.1537 1.4544
prestige_2 -0.6754 0.3165 -2.1342 0.0328 -1.2958 -0.0551
prestige_3 -1.3402 0.3453 -3.8812 0.0001 -2.0170 -0.6634
prestige_4 -1.5515 0.4178 -3.7131 0.0002 -2.3704 -0.7325
intercept -3.9900 1.1400 -3.5001 0.0005 -6.2242 -1.7557
=================================================================
R版:
data = read.csv("http://www.ats.ucla.edu/stat/data/binary.csv", head=T)
require(reshape2)
data1 = dcast(data, admit + gre + gpa ~ rank)
require(dplyr)
names(data1)[4:7] = paste("rank", 1:4, sep="")
data1 = data1[, -4]
summary(glm(admit ~ gre + gpa + rank2 + rank3 + rank4, family=binomial, data=data1))
結果:
Call:
glm(formula = admit ~ gre + gpa + rank2 + rank3 + rank4, family = binomial,
data = data1)
Deviance Residuals:
Min 1Q Median 3Q Max
-1.5133 -0.8661 -0.6573 1.1808 2.0629
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -4.184029 1.162421 -3.599 0.000319 ***
gre 0.002358 0.001112 2.121 0.033954 *
gpa 0.770591 0.343908 2.241 0.025046 *
rank2 -0.369711 0.310342 -1.191 0.233535
rank3 -1.015012 0.335147 -3.029 0.002457 **
rank4 -1.249251 0.414416 -3.014 0.002574 **
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 466.13 on 377 degrees of freedom
Residual deviance: 434.12 on 372 degrees of freedom
AIC: 446.12
Number of Fisher Scoring iterations: 4
結果完全不同,例如,rank_2 的 p 值分別為 0.03 和 0.2。 我想知道造成這種差異的原因是什么? 請注意,我為兩個版本都創建了虛擬變量,並為 python 版本創建了一個常量列,它在 R 中自動處理。
此外,python 似乎快了 2 倍:
##################################################
# python timing
def test():
for i in range(5000):
logit = sm.Logit(data["admit"], data[train_cols])
result = logit.fit(disp=0)
import time
start = time.time()
test()
print(time.time() - start)
10.099738836288452
##################################################
# R timing
> f = function() for(i in 1:5000) {mod = glm(admit ~ gre + gpa + rank2 + rank3 + rank4, family=binomial, data=data1)}
> system.time(f())
user system elapsed
17.505 0.021 17.526
不確定您的數據操作的意圖,但它們似乎在 R 運行中丟失了信息。 如果我保留所有排名信息,那么我會在原始數據對象上得到它(並且結果在它們重疊的區域中看起來非常相似。(可能性僅估計為任意常數,因此您只能比較對數似然。即使有這個警告,偏差也應該是負對數似然的兩倍,因此這些結果也具有可比性。)
> summary(glm(admit ~ gre + gpa +as.factor( rank), family=binomial,
data=data)) # notice that I'm using your original data-object
Call:
glm(formula = admit ~ gre + gpa + as.factor(rank), family = binomial,
data = data)
Deviance Residuals:
Min 1Q Median 3Q Max
-1.6268 -0.8662 -0.6388 1.1490 2.0790
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -3.989979 1.139951 -3.500 0.000465 ***
gre 0.002264 0.001094 2.070 0.038465 *
gpa 0.804038 0.331819 2.423 0.015388 *
as.factor(rank)2 -0.675443 0.316490 -2.134 0.032829 *
as.factor(rank)3 -1.340204 0.345306 -3.881 0.000104 ***
as.factor(rank)4 -1.551464 0.417832 -3.713 0.000205 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 499.98 on 399 degrees of freedom
Residual deviance: 458.52 on 394 degrees of freedom
AIC: 470.52
Number of Fisher Scoring iterations: 4
我像這樣重做了 R 部分:
makeDummy = function(x, x1) { ifelse(is.na(x), NA, ifelse(x == x1, 1, 0)) }
data = read.csv("http://www.ats.ucla.edu/stat/data/binary.csv", head=T)
data$rank2 = makeDummy(data$rank, 2)
data$rank3 = makeDummy(data$rank, 3)
data$rank4 = makeDummy(data$rank, 4)
summary(glm(admit ~ gre + gpa + rank2 + rank3 + rank4, family=binomial, data=data))
結果和上面的完全一樣
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -3.989979 1.139951 -3.500 0.000465 ***
gre 0.002264 0.001094 2.070 0.038465 *
gpa 0.804038 0.331819 2.423 0.015388 *
rank2 -0.675443 0.316490 -2.134 0.032829 *
rank3 -1.340204 0.345306 -3.881 0.000104 ***
rank4 -1.551464 0.417832 -3.713 0.000205 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 499.98 on 399 degrees of freedom
Residual deviance: 458.52 on 394 degrees of freedom
AIC: 470.52
我想要么我以錯誤的方式使用dplyr::dcast
,要么dcast
。
我只能回答而不是對已接受的答案添加評論。 在 python 中,您通常必須刪除一個虛擬類以使其成為參考類,但我認為您不需要為 R 這樣做,因為 glm 會為您設置參考類。 基本上,如果我正確理解您的代碼,您就不需要這一行...
data1 = data1[, -4]
嘗試只是按原樣放置聲望,但首先使用 as.factor() 設置為一個因子。
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.