[英]Logistic regression results different in Scikit python and R?
我在R和Python上對虹膜數據集運行邏輯回歸。但兩者都給出了不同的結果(系數,截距和分數)。
#Python codes.
In[23]: iris_df.head(5)
Out[23]:
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
0 5.1 3.5 1.4 0.2 0
1 4.9 3.0 1.4 0.2 0
2 4.7 3.2 1.3 0.2 0
3 4.6 3.1 1.5 0.2 0
In[35]: iris_df.shape
Out[35]: (100, 5)
#looking at the levels of the Species dependent variable..
In[25]: iris_df['Species'].unique()
Out[25]: array([0, 1], dtype=int64)
#creating dependent and independent variable datasets..
x = iris_df.ix[:,0:4]
y = iris_df.ix[:,-1]
#modelling starts..
y = np.ravel(y)
logistic = LogisticRegression()
model = logistic.fit(x,y)
#getting the model coefficients..
model_coef= pd.DataFrame(list(zip(x.columns, np.transpose(model.coef_))))
model_intercept = model.intercept_
In[30]: model_coef
Out[36]:
0 1
0 Sepal.Length [-0.402473917528]
1 Sepal.Width [-1.46382924771]
2 Petal.Length [2.23785647964]
3 Petal.Width [1.0000929404]
In[31]: model_intercept
Out[31]: array([-0.25906453])
#scores...
In[34]: logistic.predict_proba(x)
Out[34]:
array([[ 0.9837306 , 0.0162694 ],
[ 0.96407227, 0.03592773],
[ 0.97647105, 0.02352895],
[ 0.95654126, 0.04345874],
[ 0.98534488, 0.01465512],
[ 0.98086592, 0.01913408],
> str(irisdf)
'data.frame': 100 obs. of 5 variables:
$ Sepal.Length: num 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
$ Sepal.Width : num 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
$ Petal.Length: num 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
$ Petal.Width : num 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
$ Species : int 0 0 0 0 0 0 0 0 0 0 ...
> model <- glm(Species ~ ., data = irisdf, family = binomial)
Warning messages:
1: glm.fit: algorithm did not converge
2: glm.fit: fitted probabilities numerically 0 or 1 occurred
> summary(model)
Call:
glm(formula = Species ~ ., family = binomial, data = irisdf)
Deviance Residuals:
Min 1Q Median 3Q Max
-1.681e-05 -2.110e-08 0.000e+00 2.110e-08 2.006e-05
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 6.556 601950.324 0 1
Sepal.Length -9.879 194223.245 0 1
Sepal.Width -7.418 92924.451 0 1
Petal.Length 19.054 144515.981 0 1
Petal.Width 25.033 216058.936 0 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 1.3863e+02 on 99 degrees of freedom
Residual deviance: 1.3166e-09 on 95 degrees of freedom
AIC: 10
Number of Fisher Scoring iterations: 25
由於收斂問題,我增加了最大迭代次數並將epsilon設為0.05。
> model <- glm(Species ~ ., data = irisdf, family = binomial,control = glm.control(epsilon=0.01,trace=FALSE,maxit = 100))
> summary(model)
Call:
glm(formula = Species ~ ., family = binomial, data = irisdf,
control = glm.control(epsilon = 0.01, trace = FALSE, maxit = 100))
Deviance Residuals:
Min 1Q Median 3Q Max
-0.0102793 -0.0005659 -0.0000052 0.0001438 0.0112531
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 1.796 704.352 0.003 0.998
Sepal.Length -3.426 215.912 -0.016 0.987
Sepal.Width -4.208 123.513 -0.034 0.973
Petal.Length 7.615 159.478 0.048 0.962
Petal.Width 11.835 285.938 0.041 0.967
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 1.3863e+02 on 99 degrees of freedom
Residual deviance: 5.3910e-04 on 95 degrees of freedom
AIC: 10.001
Number of Fisher Scoring iterations: 12
#R scores..
> scores = predict(model, newdata = irisdf, type = "response")
> head(scores,5)
1 2 3 4 5
2.844996e-08 4.627411e-07 1.848093e-07 1.818231e-06 2.631029e-08
R和python中的得分,截距和系數都完全不同。哪一個是正確的,我想繼續python.Now有混淆,結果是准確的。
更新問題是沿着花瓣寬度變量存在完美的分離。 換句話說,此變量可用於完美地預測給定數據集中的樣本是setosa還是versicolor。 這打破了R中邏輯回歸中使用的對數似然最大化估計。問題是通過將花瓣寬度系數設為無窮大,可以將對數似然驅動得非常高。
這里討論一些背景和策略。
CrossValidated討論策略也有一個很好的主題 。
那么為什么sklearn LogisticRegression有效呢? 因為它采用“正則化邏輯回歸”。 正則化懲罰估計參數的大值。
在下面的示例中,我使用Firth的偏差減少邏輯回歸方法logistf來生成融合模型。
library(logistf)
iris = read.table("path_to _iris.txt", sep="\t", header=TRUE)
iris$Species <- as.factor(iris$Species)
sapply(iris, class)
model1 <- glm(Species ~ ., data = irisdf, family = binomial)
# Does not converge, throws warnings.
model2 <- logistf(Species ~ ., data = irisdf, family = binomial)
# Does converge.
ORIGINAL基於R解決方案中的std.error和z值,我認為你的模型規范不好。 接近0的z值基本上告訴您模型和因變量之間沒有相關性。 所以這是一個荒謬的模型。
我的第一個想法是你需要將Species字段轉換為分類變量。 它是您示例中的int
類型。 嘗試使用as.factor
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.