Logistic回归结果在Scikit python和R中有所不同？

Question

我在R和Python上对虹膜数据集运行逻辑回归。但两者都给出了不同的结果（系数，截距和分数）。

#Python codes.
    In[23]: iris_df.head(5)
    Out[23]: 
     Sepal.Length  Sepal.Width  Petal.Length  Petal.Width  Species
    0           5.1          3.5           1.4          0.2        0
    1           4.9          3.0           1.4          0.2        0
    2           4.7          3.2           1.3          0.2        0
    3           4.6          3.1           1.5          0.2        0
    In[35]: iris_df.shape
    Out[35]: (100, 5)
    #looking at the levels of the Species dependent variable..

        In[25]: iris_df['Species'].unique()
        Out[25]: array([0, 1], dtype=int64)

    #creating dependent and independent variable datasets..

        x = iris_df.ix[:,0:4]
        y = iris_df.ix[:,-1]

    #modelling starts..
    y = np.ravel(y)
    logistic = LogisticRegression()
    model = logistic.fit(x,y)
    #getting the model coefficients..
    model_coef= pd.DataFrame(list(zip(x.columns, np.transpose(model.coef_))))
    model_intercept = model.intercept_
    In[30]: model_coef
    Out[36]: 
                  0                  1
    0  Sepal.Length  [-0.402473917528]
    1   Sepal.Width   [-1.46382924771]
    2  Petal.Length    [2.23785647964]
    3   Petal.Width     [1.0000929404]
    In[31]: model_intercept
    Out[31]: array([-0.25906453])
    #scores...
    In[34]: logistic.predict_proba(x)
    Out[34]: 
    array([[ 0.9837306 ,  0.0162694 ],
           [ 0.96407227,  0.03592773],
           [ 0.97647105,  0.02352895],
           [ 0.95654126,  0.04345874],
           [ 0.98534488,  0.01465512],
           [ 0.98086592,  0.01913408],

R代码。

> str(irisdf)
'data.frame':   100 obs. of  5 variables:
 $ Sepal.Length: num  5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
 $ Sepal.Width : num  3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
 $ Petal.Length: num  1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
 $ Petal.Width : num  0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
 $ Species     : int  0 0 0 0 0 0 0 0 0 0 ...

 > model <- glm(Species ~ ., data = irisdf, family = binomial)
Warning messages:
1: glm.fit: algorithm did not converge 
2: glm.fit: fitted probabilities numerically 0 or 1 occurred 
> summary(model)

Call:
glm(formula = Species ~ ., family = binomial, data = irisdf)

Deviance Residuals: 
       Min          1Q      Median          3Q         Max  
-1.681e-05  -2.110e-08   0.000e+00   2.110e-08   2.006e-05  

Coefficients:
               Estimate Std. Error z value Pr(>|z|)
(Intercept)       6.556 601950.324       0        1
Sepal.Length     -9.879 194223.245       0        1
Sepal.Width      -7.418  92924.451       0        1
Petal.Length     19.054 144515.981       0        1
Petal.Width      25.033 216058.936       0        1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 1.3863e+02  on 99  degrees of freedom
Residual deviance: 1.3166e-09  on 95  degrees of freedom
AIC: 10

Number of Fisher Scoring iterations: 25

由于收敛问题，我增加了最大迭代次数并将epsilon设为0.05。

> model <- glm(Species ~ ., data = irisdf, family = binomial,control = glm.control(epsilon=0.01,trace=FALSE,maxit = 100))
> summary(model)

Call:
glm(formula = Species ~ ., family = binomial, data = irisdf, 
    control = glm.control(epsilon = 0.01, trace = FALSE, maxit = 100))

Deviance Residuals: 
       Min          1Q      Median          3Q         Max  
-0.0102793  -0.0005659  -0.0000052   0.0001438   0.0112531  

Coefficients:
             Estimate Std. Error z value Pr(>|z|)
(Intercept)     1.796    704.352   0.003    0.998
Sepal.Length   -3.426    215.912  -0.016    0.987
Sepal.Width    -4.208    123.513  -0.034    0.973
Petal.Length    7.615    159.478   0.048    0.962
Petal.Width    11.835    285.938   0.041    0.967

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 1.3863e+02  on 99  degrees of freedom
Residual deviance: 5.3910e-04  on 95  degrees of freedom
AIC: 10.001

Number of Fisher Scoring iterations: 12

#R scores..
> scores = predict(model, newdata = irisdf, type = "response")
> head(scores,5)
           1            2            3            4            5 
2.844996e-08 4.627411e-07 1.848093e-07 1.818231e-06 2.631029e-08

R和python中的得分，截距和系数都完全不同。哪一个是正确的，我想继续python.Now有混淆，结果是准确的。

Answer 1

更新问题是沿着花瓣宽度变量存在完美的分离。 换句话说，此变量可用于完美地预测给定数据集中的样本是setosa还是versicolor。 这打破了R中逻辑回归中使用的对数似然最大化估计。问题是通过将花瓣宽度系数设为无穷大，可以将对数似然驱动得非常高。

这里讨论一些背景和策略。

CrossValidated讨论策略也有一个很好的主题。

那么为什么sklearn LogisticRegression有效呢？ 因为它采用“正则化逻辑回归”。 正则化惩罚估计参数的大值。

在下面的示例中，我使用Firth的偏差减少逻辑回归方法logistf来生成融合模型。

library(logistf)

iris = read.table("path_to _iris.txt", sep="\t", header=TRUE)
iris$Species <- as.factor(iris$Species)
sapply(iris, class)

model1 <- glm(Species ~ ., data = irisdf, family = binomial)
# Does not converge, throws warnings.

model2 <- logistf(Species ~ ., data = irisdf, family = binomial)
# Does converge.

ORIGINAL基于R解决方案中的std.error和z值，我认为你的模型规范不好。 接近0的z值基本上告诉您模型和因变量之间没有相关性。 所以这是一个荒谬的模型。

我的第一个想法是你需要将Species字段转换为分类变量。 它是您示例中的int类型。 尝试使用as.factor

如何将整数转换为R中的分类数据？

Logistic回归结果在Scikit python和R中有所不同？

问题描述

R代码。

1 个解决方案

解决方案1
3 已采纳 2016-06-17 05:28:05

Logistic回归结果在Scikit python和R中有所不同？

问题描述

R代码。

1 个解决方案

解决方案1 3 已采纳 2016-06-17 05:28:05

解决方案1
3 已采纳 2016-06-17 05:28:05