简体   繁体   English

基于Groupby的线性回归

[英]Linear Regression based on Groupby

I have a df like this: 我有这样的df:

Allotment   Year    NDVI     A_Annex    Bachelor
A_Annex     1984    1.0      0.40       0.60
A_Annex     1984    1.5      0.56       0.89
A_Annex     1984    2.0      0.78       0.76
A_Annex     1985    3.4      0.89       0.54
A_Annex     1985    1.6      0.98       0.66
A_Annex     1986    2.5      1.10       0.44
A_Annex     1986    1.7      0.87       0.65
Bachelor    1984    8.9      0.40       0.60
Bachelor    1984    6.5      0.56       0.89
Bachelor    1984    4.2      0.78       0.76
Bachelor    1985    2.4      0.89       0.54
Bachelor    1985    1.7      0.98       0.66
Bachelor    1986    8.9      1.10       0.44
Bachelor    1986    9.6      0.87       0.65

and I want to run a regression based on a groupby. 我想基于groupby进行回归。 I want to regress each unique Allotment and its NDVI value with its associated column. 我想将每个唯一Allotment及其NDVI值与其关联的列进行回归。 So I want to regress the column A_Annex with the Allotment A_Annex and its associated NDVI . 所以我想退步列A_AnnexAllotment A_Annex及其相关NDVI And then I want to do the same thing but with Bachelor . 然后我想和Bachelor一起做同样的事情。 Essentially I want to match the columns with the associated Allotment and then regress the values in the column with the corresponding NDVI values. 本质上,我想将列与关联的Allotment进行匹配,然后将列中的值与相应的NDVI值进行回归。

I could do this for one Allotment like this: 我可以这样分配:

stat=merge.groupby(['Allotment']).apply(lambda x: sp.stats.linregress(x['A_Annex'], x['NDVI']))

but I would need to continue to change the x value in sp.stats.linregress(x['A_Annex'], x['NDVI'])) and I would like to avoid that. 但我需要继续更改sp.stats.linregress(x['A_Annex'], x['NDVI']))的x值,我想避免这种情况。

Are you after something like this? 你是在追求这样的东西吗?

r = {annex: pd.ols(x=group['A_Annex'], y=group['NDVI']) 
     for annex, group in df.groupby('Allotment')}
>>> r

{'A_Annex': 
 -------------------------Summary of Regression Analysis-------------------------

 Formula: Y ~ <x> + <intercept>

 Number of Observations:         7
 Number of Degrees of Freedom:   2

 R-squared:         0.3774
 Adj R-squared:     0.2529

 Rmse:              0.6785

 F-stat (1, 5):     3.0307, p-value:     0.1422

 Degrees of Freedom: model 1, resid 5

 -----------------------Summary of Estimated Coefficients------------------------
       Variable       Coef    Std Err     t-stat    p-value    CI 2.5%   CI 97.5%
 --------------------------------------------------------------------------------
              x     1.9871     1.1415       1.74     0.1422    -0.2501     4.2244
      intercept     0.3731     0.9454       0.39     0.7094    -1.4798     2.2260
 ---------------------------------End of Summary---------------------------------,
 'Bachelor': 
 -------------------------Summary of Regression Analysis-------------------------

 Formula: Y ~ <x> + <intercept>

 Number of Observations:         7
 Number of Degrees of Freedom:   2

 R-squared:         0.0650
 Adj R-squared:    -0.1220

 Rmse:              3.4787

 F-stat (1, 5):     0.3478, p-value:     0.5810

 Degrees of Freedom: model 1, resid 5

 -----------------------Summary of Estimated Coefficients------------------------
       Variable       Coef    Std Err     t-stat    p-value    CI 2.5%   CI 97.5%
 --------------------------------------------------------------------------------
              x    -3.4511     5.8522      -0.59     0.5810   -14.9213     8.0191
      intercept     8.7796     4.8467       1.81     0.1298    -0.7200    18.2792
 ---------------------------------End of Summary---------------------------------}

You can then extract the model parameters as follows: 然后可以按以下方式提取模型参数:

>>> {k: r[k].sm_ols.params for k in r}
{'A_Annex': array([ 1.9871432 ,  0.37310585]),
 'Bachelor': array([-3.45111992,  8.77960702])}

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM