简体   繁体   English

将预测值和残差追加到熊猫数据框

[英]Appending predicted values and residuals to pandas dataframe

It's a useful and common practice to append predicted values and residuals from running a regression onto a dataframe as distinct columns. 将回归分析的预测值和残差作为不同的列附加到数据框上是一种有用且常见的做法。 I'm new to pandas, and I'm having trouble performing this very simple operation. 我是熊猫的新手,在执行此非常简单的操作时遇到了麻烦。 I know I'm missing something obvious. 我知道我缺少明显的东西。 There was a very similar question asked about a year-and-a-half ago, but it wasn't really answered. 大约一年半前,有一个非常类似的问题被提出,但并没有得到真正的回答。

The dataframe currently looks something like this: 该数据框当前看起来像这样:

y               x1           x2   
880.37          3.17         23
716.20          4.76         26
974.79          4.17         73
322.80          8.70         72
1054.25         11.45        16

And all I'm wanting is to return a dataframe that has the predicted value and residual from y = x1 + x2 for each observation: 我想要的是返回一个数据帧,该数据帧具有每个观察值的y = x1 + x2的预测值和残差:

y               x1           x2       y_hat         res
880.37          3.17         23       840.27        40.10
716.20          4.76         26       752.60        -36.40
974.79          4.17         73       877.49        97.30
322.80          8.70         72       348.50        -25.70
1054.25         11.45        16       815.15        239.10

I've tried resolving this using statsmodels and pandas and haven't been able to solve it. 我尝试使用statsmodels和pandas解决此问题,但尚未解决。 Thanks in advance! 提前致谢!

Here is a variation on Alexander's answer using the OLS model from statsmodels instead of the pandas ols model. 这是使用来自statsmodels的OLS模型而不是pandas ols模型的Alexander答案的变体。 We can use either the formula or the array/DataFrame interface to the models. 我们可以使用公式或模型的array / DataFrame接口。

fittedvalues and resid are pandas Series with the correct index. fittedvaluesresid是熊猫系列与正确的索引。 predict does not return a pandas Series. predict不会返回熊猫系列。

import numpy as np
import pandas as pd
import statsmodels.api as sm
import statsmodels.formula.api as smf

df = pd.DataFrame({'x1': [3.17, 4.76, 4.17, 8.70, 11.45],
                   'x2': [23, 26, 73, 72, 16],
                   'y': [880.37, 716.20, 974.79, 322.80, 1054.25]},
                   index=np.arange(10, 20, 2))

result = smf.ols('y ~ x1 + x2', df).fit()
df['yhat'] = result.fittedvalues
df['resid'] = result.resid


result2 = sm.OLS(df['y'], sm.add_constant(df[['x1', 'x2']])).fit()
df['yhat2'] = result2.fittedvalues
df['resid2'] = result2.resid

# predict doesn't return pandas series and no index is available
df['predicted'] = result.predict(df)

print(df)

       x1  x2        y        yhat       resid       yhat2      resid2  \
10   3.17  23   880.37  923.949309  -43.579309  923.949309  -43.579309   
12   4.76  26   716.20  890.732201 -174.532201  890.732201 -174.532201   
14   4.17  73   974.79  656.155079  318.634921  656.155079  318.634921   
16   8.70  72   322.80  610.510952 -287.710952  610.510952 -287.710952   
18  11.45  16  1054.25  867.062458  187.187542  867.062458  187.187542   

     predicted  
10  923.949309  
12  890.732201  
14  656.155079  
16  610.510952  
18  867.062458  

As preview, there is an extended prediction method in the model results in statsmodels master (0.7), but the API is not yet settled: 作为预览,在statsmodels master(0.7)中的模型结果中有一个扩展的预测方法,但该API尚未确定:

>>> print(result.get_prediction().summary_frame())
          mean     mean_se  mean_ci_lower  mean_ci_upper  obs_ci_lower  \
10  923.949309  268.931939    -233.171432    2081.070051   -991.466820   
12  890.732201  211.945165     -21.194241    1802.658643   -887.328646   
14  656.155079  269.136102    -501.844105    1814.154263  -1259.791854   
16  610.510952  282.182030    -603.620329    1824.642233  -1339.874985   
18  867.062458  329.017262    -548.584564    2282.709481  -1214.750941   

    obs_ci_upper  
10   2839.365439  
12   2668.793048  
14   2572.102012  
16   2560.896890  
18   2948.875858  

This should be self explanatory. 这应该是不言自明的。

import pandas as pd

df = pd.DataFrame({'x1': [3.17, 4.76, 4.17, 8.70, 11.45],
                   'x2': [23, 26, 73, 72, 16],
                   'y': [880.37, 716.20, 974.79, 322.80, 1054.25]})
model = pd.ols(y=df.y, x=df.loc[:, ['x1', 'x2']])
df['y_hat'] = model.y_fitted
df['res'] = model.resid

>>> df
      x1  x2        y       y_hat         res
0   3.17  23   880.37  923.949309  -43.579309
1   4.76  26   716.20  890.732201 -174.532201
2   4.17  73   974.79  656.155079  318.634921
3   8.70  72   322.80  610.510952 -287.710952
4  11.45  16  1054.25  867.062458  187.187542

So, it's polite to form your questions such that it's easy for contributors to run your code. 因此,礼貌地提出您的问题,以使贡献者可以轻松地运行您的代码。

import pandas as pd

y_col = [880.37, 716.20, 974.79, 322.80, 1054.25]
x1_col = [3.17, 4.76, 4.17, 8.70, 11.45]
x2_col = [23, 26, 73, 72, 16]

df = pd.DataFrame()
df['y'] = y_col
df['x1'] = x1_col
df['x2'] = x2_col

Then calling df.head() yields: 然后调用df.head()产生:

         y     x1  x2
0   880.37   3.17  23
1   716.20   4.76  26
2   974.79   4.17  73
3   322.80   8.70  72
4  1054.25  11.45  16

Now for your question, it's fairly straightforward to add columns with calculated values, though I'm not agreeing with your sample data: 现在,对于您的问题,添加具有计算值的列非常简单,尽管我不同意您的示例数据:

df['y_hat'] = df['x1'] + df['x2']
df['res'] = df['y'] - df['y_hat']

For me, these yield: 对我来说,这些收益:

         y     x1  x2  y_hat      res
0   880.37   3.17  23  26.17   854.20
1   716.20   4.76  26  30.76   685.44
2   974.79   4.17  73  77.17   897.62
3   322.80   8.70  72  80.70   242.10
4  1054.25  11.45  16  27.45  1026.80

Hope this helps! 希望这可以帮助!

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM