[英]Appending predicted values and residuals to pandas dataframe
It's a useful and common practice to append predicted values and residuals from running a regression onto a dataframe as distinct columns. 将回归分析的预测值和残差作为不同的列附加到数据框上是一种有用且常见的做法。 I'm new to pandas, and I'm having trouble performing this very simple operation. 我是熊猫的新手,在执行此非常简单的操作时遇到了麻烦。 I know I'm missing something obvious. 我知道我缺少明显的东西。 There was a very similar question asked about a year-and-a-half ago, but it wasn't really answered. 大约一年半前,有一个非常类似的问题被提出,但并没有得到真正的回答。
The dataframe currently looks something like this: 该数据框当前看起来像这样:
y x1 x2
880.37 3.17 23
716.20 4.76 26
974.79 4.17 73
322.80 8.70 72
1054.25 11.45 16
And all I'm wanting is to return a dataframe that has the predicted value and residual from y = x1 + x2 for each observation: 我想要的是返回一个数据帧,该数据帧具有每个观察值的y = x1 + x2的预测值和残差:
y x1 x2 y_hat res
880.37 3.17 23 840.27 40.10
716.20 4.76 26 752.60 -36.40
974.79 4.17 73 877.49 97.30
322.80 8.70 72 348.50 -25.70
1054.25 11.45 16 815.15 239.10
I've tried resolving this using statsmodels and pandas and haven't been able to solve it. 我尝试使用statsmodels和pandas解决此问题,但尚未解决。 Thanks in advance! 提前致谢!
Here is a variation on Alexander's answer using the OLS model from statsmodels instead of the pandas ols model. 这是使用来自statsmodels的OLS模型而不是pandas ols模型的Alexander答案的变体。 We can use either the formula or the array/DataFrame interface to the models. 我们可以使用公式或模型的array / DataFrame接口。
fittedvalues
and resid
are pandas Series with the correct index. fittedvalues
和resid
是熊猫系列与正确的索引。 predict
does not return a pandas Series. predict
不会返回熊猫系列。
import numpy as np
import pandas as pd
import statsmodels.api as sm
import statsmodels.formula.api as smf
df = pd.DataFrame({'x1': [3.17, 4.76, 4.17, 8.70, 11.45],
'x2': [23, 26, 73, 72, 16],
'y': [880.37, 716.20, 974.79, 322.80, 1054.25]},
index=np.arange(10, 20, 2))
result = smf.ols('y ~ x1 + x2', df).fit()
df['yhat'] = result.fittedvalues
df['resid'] = result.resid
result2 = sm.OLS(df['y'], sm.add_constant(df[['x1', 'x2']])).fit()
df['yhat2'] = result2.fittedvalues
df['resid2'] = result2.resid
# predict doesn't return pandas series and no index is available
df['predicted'] = result.predict(df)
print(df)
x1 x2 y yhat resid yhat2 resid2 \
10 3.17 23 880.37 923.949309 -43.579309 923.949309 -43.579309
12 4.76 26 716.20 890.732201 -174.532201 890.732201 -174.532201
14 4.17 73 974.79 656.155079 318.634921 656.155079 318.634921
16 8.70 72 322.80 610.510952 -287.710952 610.510952 -287.710952
18 11.45 16 1054.25 867.062458 187.187542 867.062458 187.187542
predicted
10 923.949309
12 890.732201
14 656.155079
16 610.510952
18 867.062458
As preview, there is an extended prediction method in the model results in statsmodels master (0.7), but the API is not yet settled: 作为预览,在statsmodels master(0.7)中的模型结果中有一个扩展的预测方法,但该API尚未确定:
>>> print(result.get_prediction().summary_frame())
mean mean_se mean_ci_lower mean_ci_upper obs_ci_lower \
10 923.949309 268.931939 -233.171432 2081.070051 -991.466820
12 890.732201 211.945165 -21.194241 1802.658643 -887.328646
14 656.155079 269.136102 -501.844105 1814.154263 -1259.791854
16 610.510952 282.182030 -603.620329 1824.642233 -1339.874985
18 867.062458 329.017262 -548.584564 2282.709481 -1214.750941
obs_ci_upper
10 2839.365439
12 2668.793048
14 2572.102012
16 2560.896890
18 2948.875858
This should be self explanatory. 这应该是不言自明的。
import pandas as pd
df = pd.DataFrame({'x1': [3.17, 4.76, 4.17, 8.70, 11.45],
'x2': [23, 26, 73, 72, 16],
'y': [880.37, 716.20, 974.79, 322.80, 1054.25]})
model = pd.ols(y=df.y, x=df.loc[:, ['x1', 'x2']])
df['y_hat'] = model.y_fitted
df['res'] = model.resid
>>> df
x1 x2 y y_hat res
0 3.17 23 880.37 923.949309 -43.579309
1 4.76 26 716.20 890.732201 -174.532201
2 4.17 73 974.79 656.155079 318.634921
3 8.70 72 322.80 610.510952 -287.710952
4 11.45 16 1054.25 867.062458 187.187542
So, it's polite to form your questions such that it's easy for contributors to run your code. 因此,礼貌地提出您的问题,以使贡献者可以轻松地运行您的代码。
import pandas as pd
y_col = [880.37, 716.20, 974.79, 322.80, 1054.25]
x1_col = [3.17, 4.76, 4.17, 8.70, 11.45]
x2_col = [23, 26, 73, 72, 16]
df = pd.DataFrame()
df['y'] = y_col
df['x1'] = x1_col
df['x2'] = x2_col
Then calling df.head()
yields: 然后调用df.head()
产生:
y x1 x2
0 880.37 3.17 23
1 716.20 4.76 26
2 974.79 4.17 73
3 322.80 8.70 72
4 1054.25 11.45 16
Now for your question, it's fairly straightforward to add columns with calculated values, though I'm not agreeing with your sample data: 现在,对于您的问题,添加具有计算值的列非常简单,尽管我不同意您的示例数据:
df['y_hat'] = df['x1'] + df['x2']
df['res'] = df['y'] - df['y_hat']
For me, these yield: 对我来说,这些收益:
y x1 x2 y_hat res
0 880.37 3.17 23 26.17 854.20
1 716.20 4.76 26 30.76 685.44
2 974.79 4.17 73 77.17 897.62
3 322.80 8.70 72 80.70 242.10
4 1054.25 11.45 16 27.45 1026.80
Hope this helps! 希望这可以帮助!
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.