简体   繁体   English

Python 2.7 - statsmodels - 格式化和编写摘要输出

[英]Python 2.7 - statsmodels - formatting and writing summary output

I'm doing logistic regression using pandas 0.11.0 (data handling) and statsmodels 0.4.3 to do the actual regression, on Mac OSX Lion.我正在使用pandas 0.11.0 (数据处理)和statsmodels 0.4.3进行statsmodels 0.4.3回归,以在 Mac OSX Lion 上进行实际回归。

I'm going to be running ~2,900 different logistic regression models and need the results output to csv file and formatted in a particular way.我将运行约 2,900 个不同的逻辑回归模型,需要将结果输出到 csv 文件并以特定方式格式化。

Currently, I'm only aware of doing print result.summary() which prints the results (as follows) to the shell:目前,我只知道执行print result.summary()将结果(如下)打印到外壳:

 Logit Regression Results                           
  ==============================================================================
 Dep. Variable:            death_death   No. Observations:                 9752
 Model:                          Logit   Df Residuals:                     9747
 Method:                           MLE   Df Model:                            4
 Date:                Wed, 22 May 2013   Pseudo R-squ.:                -0.02672
 Time:                        22:15:05   Log-Likelihood:                -5806.9
 converged:                       True   LL-Null:                       -5655.8
                                         LLR p-value:                     1.000
 ===============================================================================
                   coef    std err          z      P>|z|      [95.0% Conf. Int.]
 -------------------------------------------------------------------------------
 age_age5064    -0.1999      0.055     -3.619      0.000        -0.308    -0.092
 age_age6574    -0.2553      0.053     -4.847      0.000        -0.359    -0.152
 sex_female     -0.2515      0.044     -5.765      0.000        -0.337    -0.166
 stage_early    -0.1838      0.041     -4.528      0.000        -0.263    -0.104
 access         -0.0102      0.001    -16.381      0.000        -0.011    -0.009
 ===============================================================================

I will also need the odds ratio, which is computed by print np.exp(result.params) , and is printed in the shell as such:我还需要优势比,它是由print np.exp(result.params)计算的,并在 shell 中这样打印:

age_age5064    0.818842
age_age6574    0.774648
sex_female     0.777667
stage_early    0.832098
access         0.989859
dtype: float64

What I need is for these each to be written to a csv file in form of a very lon row like (am not sure, at this point, whether I will need things like Log-Likelihood , but have included it for the sake of thoroughness):我需要的是将这些每个都以非常长的行的形式写入 csv 文件(我不确定,在这一点上,我是否需要像Log-Likelihood这样的东西,但为了彻底而将它包括在内):

`Log-Likelihood, age_age5064_coef, age_age5064_std_err, age_age5064_z, age_age5064_p>|z|,...age_age6574_coef, age_age6574_std_err, ......access_coef, access_std_err, ....age_age5064_odds_ratio, age_age6574_odds_ratio, ...sex_female_odds_ratio,.....access_odds_ratio`

I think you get the picture - a very long row, with all of these actual values, and a header with all the column designations in a similar format.我想你明白了 - 一个很长的行,包含所有这些实际值,以及一个标题,所有列名称都采用类似的格式。

I am familiar with the csv module in Python, and am becoming more familiar with pandas .我熟悉 Python 中的csv module ,并且越来越熟悉pandas Not sure whether this info could be formatted and stored in a pandas dataframe and then written, using to_csv to a file once all ~2,900 logistic regression models have completed;不确定此信息是否可以格式化并存储在to_csv pandas dataframe to_csv ,然后在所有 ~2,900 个逻辑回归模型完成后使用to_csv写入文件; that would certainly be fine.那肯定没问题。 Also, writing them as each model is completed is also fine (using csv module ).此外,在每个模型完成时编写它们也很好(使用csv module )。

UPDATE:更新:

So, I was looking more at statsmodels site, specifically trying to figure out how the results of a model are stored within classes.所以,我更多地关注 statsmodels 站点,特别是试图弄清楚模型的结果如何存储在类中。 It looks like there is a class called 'Results', which will need to be used.看起来有一个名为“Results”的类,需要使用它。 I think using inheritance from this class to create another class, where some of the methods/operators get changed might be the way to go, in order to get the formatting I require.我认为使用这个类的继承来创建另一个类,其中一些方法/运算符被更改可能是要走的路,以获得我需要的格式。 I have very little experience in the ways of doing this, and will need to spend quite a bit of time figuring this out (which is fine).我在这方面的经验很少,需要花很多时间来解决这个问题(这很好)。 If anybody can help/has more experience that would be awesome!如果有人可以提供帮助/有更多经验,那就太棒了!

Here is the site where the classes are laid out: statsmodels results class这是布置类的站点: statsmodels 结果类

There is no premade table of parameters and their result statistics currently available.当前没有预制的参数表及其结果统计数据。

Essentially you need to stack all the results yourself, whether in a list, numpy array or pandas DataFrame depends on what's more convenient for you.本质上,您需要自己堆叠所有结果,无论是在列表、numpy 数组还是 Pandas DataFrame 中,都取决于对您来说更方便的方式。

for example, if I want one numpy array that has the results for a model, llf and results in the summary parameter table, then I could use例如,如果我想要一个包含模型结果的 numpy 数组,llf 和汇总参数表中的结果,那么我可以使用

res_all = []
for res in results:
    low, upp = res.confint().T   # unpack columns 
    res_all.append(numpy.concatenate(([res.llf], res.params, res.tvalues, res.pvalues, 
                   low, upp)))

But it might be better to align with pandas, depending on what structure you have across models.但是与熊猫对齐可能会更好,具体取决于您跨模型的结构。

You could write a helper function that takes all the results from the results instance and concatenates them in a row.您可以编写一个辅助函数,它从结果实例中获取所有结果并将它们连接成一行。

(I'm not sure what's the most convenient for writing to csv by rows) (我不确定按行写入 csv 最方便的是什么)

edit:编辑:

Here is an example storing the regression results in a dataframe这是将回归结果存储在数据框中的示例

https://github.com/statsmodels/statsmodels/blob/master/statsmodels/sandbox/multilinear.py#L21 https://github.com/statsmodels/statsmodels/blob/master/statsmodels/sandbox/multilinear.py#L21

the loop is on line 159.循环在第 159 行。

summary() and similar code outside of statsmodels, for example http://johnbeieler.org/py_apsrtable/ for combining several results, is oriented towards printing and not to store variables. summary() 和 statsmodels 之外的类似代码,例如http://johnbeieler.org/py_apsrtable/用于组合多个结果,面向打印而不是存储变量。

write_path = '/my/path/here/output.csv'
with open(write_path, 'w') as f:
    f.write(result.summary().as_csv())
  • results.params : for coefficient results.params : 系数
  • results.pvalues : for p-values results.pvalues : 对于 p 值

BTW you can use dir(results) to find out all the attribute of an object顺便说一句,您可以使用 dir(results) 找出对象的所有属性

I found this formulation to be a little more straightforward.我发现这个公式更简单一些。 You can add/subtract columns by following the syntax from the examples (pvals,coeff,conf_lower,conf_higher).您可以按照示例 (pvals,coeff,conf_lower,conf_higher) 中的语法添加/减去列。

import pandas as pd     #This can be left out if already present...

def results_summary_to_dataframe(results):
    '''This takes the result of an statsmodel results table and transforms it into a dataframe'''
    pvals = results.pvalues
    coeff = results.params
    conf_lower = results.conf_int()[0]
    conf_higher = results.conf_int()[1]

    results_df = pd.DataFrame({"pvals":pvals,
                               "coeff":coeff,
                               "conf_lower":conf_lower,
                               "conf_higher":conf_higher
                                })

    #Reordering...
    results_df = results_df[["coeff","pvals","conf_lower","conf_higher"]]
    return results_df

There is actually a built-in method documented in the documentation here :实际上这里的文档中记录一个内置方法:

f = open('csvfile.csv','w')
f.write(result.summary().as_csv())
f.close()

I believe this is a much easier (and clean) way to output the summaries to csv files.我相信这是将摘要输出到 csv 文件的更简单(和干净)的方法。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM