[英]Python Dataframe: Calculating Confidence or Prediction Intervals Using Groupby on One Column
I have a table like below: 我有一个如下表:
Type Actual Predicted
A 4 3
A 10 18
A 13 11
B 3 10
B 4 2
B 8 33
C 20 17
C 40 33
C 87 80
C 32 30
I wanted to calculate the R^2 and RMSE for each Type. 我想计算每种类型的R ^ 2和RMSE。 The code to do that is below:
执行此操作的代码如下:
import numpy as np
import pandas as pd
from sklearn.metrics import r2_score, mean_squared_error
def r2_rmse( g ):
r2 = r2_score( g['Actual'], g['Predicted'] )
rmse = np.sqrt( mean_squared_error( g['Actual'], g['Predicted'] ) )
return pd.Series( dict( r2 = r2, rmse = rmse ) )
your_df.groupby( 'Type' ).apply( r2_rmse ).reset_index()
Sample Output Table (values are hypothetical): 样本输出表(值是假设的):
Type R^2 RMSE
A 0.66 4
B 1.00 6
C 0.03 1
The above code worked and gave me the output I wanted. 上面的代码有效,并给了我想要的输出。 But now I want to add confidence / prediction intervals into the table at the Type level.
但是现在我想在类型级别的表中添加置信度/预测间隔。 I have literally scoured the internet on how to do this with no luck.
我确实在互联网上搜索了如何做到这一点,但是没有运气。
Conceptual Question: If I want the range of values in which the Actual value is captured with 95% confidence, do I run the confidence interval on the Actual column or the Predicted column? 概念性问题:如果我希望以95%置信度捕获实际值的值范围,我应该在“实际”列还是“预测”列上运行置信区间?
Below is the sample table I want: 以下是我想要的示例表:
Type Conf_Int_90% Conf_Int_80%
A (21, 100) (5, 55)
B (10, 46) (3, 14)
C (1, 19) (12, 19)
I have a sense that the confidence interval code is something like this: 我感觉到置信区间代码是这样的:
st.t.interval(0.95, len(a)-1, loc=np.mean(a), scale=st.sem(a)) BUT ...
What specific code do I incorporate into my existing code (shown above) so I get the table output I want? 我应将哪些特定代码合并到现有代码中(如上所示),以便获得所需的表输出?
try following, and from my understanding, the confidence interval should be operated in the predicted columns. 尝试遵循,据我所知,置信区间应在预测列中进行操作。 Hope it helps you :)
希望它能对您有所帮助:)
import numpy as np
import pandas as pd
import scipy.stats as st
from sklearn.metrics import r2_score, mean_squared_error
def r2_rmse_interval(g):
r2 = r2_score( g['Actual'], g['Predicted'] )
rmse = np.sqrt( mean_squared_error( g['Actual'], g['Predicted'] ))
st_interval = st.t.interval(0.95, len(g) -1, loc=np.mean(g.Predicted), scale=st.sem(g.Predicted))
return pd.Series( dict( r2 = r2, rmse = rmse, st_interval = st_interval) )
df = pd.DataFrame({'Type': ['A', 'A', 'A', 'B', 'B', 'B', 'C', 'C', 'C', 'C'],
'Actual': [4, 10, 13, 3, 4, 8, 20, 40, 87, 32],
'Predicted': [3, 18, 11, 10, 2, 33, 17, 33, 80, 30]},
columns=['Type', 'Actual', 'Predicted'])
df.groupby( 'Type' ).apply( r2_rmse_interval ).reset_index()
Using the standard formula for 95% CI : 使用95%CI的标准公式 :
sample mean +/- 1.96 * std.err
You can do everything in one go with apply
: 您可以使用
apply
一次性完成所有操作:
def stats(g):
r2 = r2_score(g.Actual, g.Predicted)
rmse = np.sqrt(mean_squared_error(g.Actual, g.Predicted))
ci95_hi = g.Predicted.mean() + g.Predicted.sem() * 1.96
ci95_lo = g.Predicted.mean() - g.Predicted.sem() * 1.96
return r2, rmse,(ci95_lo, ci95_hi)
df.groupby("Type").apply(stats)
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.