简体   繁体   English

statsmodels为什么不能重现我的R logistic回归结果?

[英]Why can't statsmodels reproduce my R logistic regression results?

I'm confused about why my logistic regression models in R and statsmodels do not agree. 我很困惑为什么我的R和statsmodels中的逻辑回归模型不一致。

If I prepare some data in R with 如果我用R准备一些数据

# From https://courses.edx.org/c4x/MITx/15.071x/asset/census.csv
library(caTools) # for sample.split
census = read.csv("census.csv")
set.seed(2000)
split = sample.split(census$over50k, SplitRatio = 0.6)
censusTrain = subset(census, split==TRUE)
censusTest = subset(census, split==FALSE)

and then run a logistic regression with 然后使用

CensusLog1 = glm(over50k ~., data=censusTrain, family=binomial)

I see results like 我看到类似的结果

                                           Estimate Std. Error z value Pr(>|z|)    
(Intercept)                              -8.658e+00  1.379e+00  -6.279 3.41e-10 ***
age                                       2.548e-02  2.139e-03  11.916  < 2e-16 ***
workclass Federal-gov                     1.105e+00  2.014e-01   5.489 4.03e-08 ***
workclass Local-gov                       3.675e-01  1.821e-01   2.018 0.043641 *  
workclass Never-worked                   -1.283e+01  8.453e+02  -0.015 0.987885    
workclass Private                         6.012e-01  1.626e-01   3.698 0.000218 ***
workclass Self-emp-inc                    7.575e-01  1.950e-01   3.884 0.000103 ***
workclass Self-emp-not-inc                1.855e-01  1.774e-01   1.046 0.295646    
workclass State-gov                       4.012e-01  1.961e-01   2.046 0.040728 *  
workclass Without-pay                    -1.395e+01  6.597e+02  -0.021 0.983134   
...

but of I use the same data in Python, by first exporting from R with 但是我首先在R中使用

write.csv(censusTrain,file="traincensus.csv")
write.csv(censusTest,file="testcensus.csv")

and then importing into Python with 然后使用

import pandas as pd

census = pd.read_csv("census.csv")
census_train = pd.read_csv("traincensus.csv")
census_test = pd.read_csv("testcensus.csv")

I get errors and strange results that bear no relationship to the ones I get in R. 我得到的错误和奇怪的结果与我在R中获得的结果没有任何关系。

If I simply try 如果我只是尝试

import statsmodels.api as sm

census_log_1 = sm.Logit.from_formula(f, census_train).fit()

I get an error: 我收到一个错误:

ValueError: operands could not be broadcast together with shapes (19187,2) (19187,) 

Even if prepare the data with patsy using 即使使用patsy准备数据

import patsy
f = 'over50k ~ ' + ' + '.join(list(census.columns)[:-1])
y, X = patsy.dmatrices(f, census_train, return_type='dataframe')

trying

census_log_1 = sm.Logit(y, X).fit()

results in the same error. 导致相同的错误。 The only way I can avoid errors is to use use GLM 我唯一可以避免错误的方法是使用use GLM

census_log_1 = sm.GLM(y, X, family=sm.families.Binomial()).fit()

but this produces results that are entirely different from those produced by (what I thought was) the equivalent R API: 但这会产生与完全相同的R API产生的结果 (我认为是)完全不同的结果

                                                   coef    std err          t      P>|t|      [95.0% Conf. Int.]
----------------------------------------------------------------------------------------------------------------
Intercept                                       10.6766      5.985      1.784      0.074        -1.055    22.408
age                                             -0.0255      0.002    -11.916      0.000        -0.030    -0.021
workclass[T. Federal-gov]                       -0.9775      4.498     -0.217      0.828        -9.794     7.839
workclass[T. Local-gov]                         -0.2395      4.498     -0.053      0.958        -9.055     8.576
workclass[T. Never-worked]                       8.8346    114.394      0.077      0.938      -215.374   233.043
workclass[T. Private]                           -0.4732      4.497     -0.105      0.916        -9.288     8.341
workclass[T. Self-emp-inc]                      -0.6296      4.498     -0.140      0.889        -9.446     8.187
workclass[T. Self-emp-not-inc]                  -0.0576      4.498     -0.013      0.990        -8.873     8.758
workclass[T. State-gov]                         -0.2733      4.498     -0.061      0.952        -9.090     8.544
workclass[T. Without-pay]                       10.0745     85.048      0.118      0.906      -156.616   176.765
...

Why is logistic regression in Python producing errors and different results from those produced by R? 为什么Python中的逻辑回归会产生错误,并且结果与R产生的结果不同? Are these APIs not in fact equivalent (I've had them work before to produce identical results)? 这些API实际上不是等效的吗(我已经让它们工作过才能产生相同的结果)? Is there some additional processing of the datasets required to make them usable by statsmodels? 是否需要对数据集进行一些其他处理,以使它们可被statsmodels使用?

The error is due to the fact that patsy expands the LHS variable to be a full Treatement contrast. 该错误是由于patsy将LHS变量扩展为完整的Treatement对比所致。 Logit does not handle this as indicated in the docstring, but as you see GLM with binomial family does. Logit不会按照文档字符串中的指示进行处理,但是正如您看到的具有二项式族的GLM一样。

I can't speak to the difference in the results without a full output. 没有完整的输出,我无法说出结果的差异。 In all likelihood it's different default handling of categorical variables or you're using different variables. 很有可能是分类变量的默认处理不同,或者您使用的是不同的变量。 Not all are listed in your output. 并非所有都列出在您的输出中。

You can use logit by doing the following pre-processing step. 您可以通过执行以下预处理步骤来使用logit。

census = census.replace(to_replace={'over50k' : {' <=50K' : 0, ' >50K' : 1}})

Note also that the default solver for logit doesn't seem to work all that well for this problem. 还要注意,默认的logit求解器似乎不能很好地解决此问题。 It runs into a singular matrix problem. 它遇到一个奇异矩阵问题。 Indeed, the condition number for this problem is huge, and what you get in R might not be a fully converged model. 确实,此问题的条件数很大,并且您在R中得到的可能不是完全收敛的模型。 You might try reducing your number of dummy variables. 您可以尝试减少虚拟变量的数量。

[~/]
[73]: np.linalg.cond(mod.exog)
[73]: 4.5139498536894682e+17

I had to use the following to get a solution 我必须使用以下方法来获取解决方案

mod = sm.formula.logit(f, data=census)
res = mod.fit(method='bfgs', maxiter=1000)    

Some of your cells end up being very small. 您的某些细胞最终变得非常小。 This is compounded by the other sparse dummy variables. 其他稀疏的虚拟变量使情况更加复杂。

[~/]
[81]: pd.Categorical(census.occupation).describe()
[81]: 
                    counts     freqs
levels                              
?                    1816  0.056789
Adm-clerical         3721  0.116361
Armed-Forces            9  0.000281
Craft-repair         4030  0.126024
Exec-managerial      3992  0.124836
Farming-fishing       989  0.030928
Handlers-cleaners    1350  0.042217
Machine-op-inspct    1966  0.061480
Other-service        3212  0.100444
Priv-house-serv       143  0.004472
Prof-specialty       4038  0.126274
Protective-serv       644  0.020139
Sales                3584  0.112077
Tech-support          912  0.028520
Transport-moving     1572  0.049159

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM