简体   繁体   English

Python数据框:在一行上使用Groupby计算置信度或预测间隔

[英]Python Dataframe: Calculating Confidence or Prediction Intervals Using Groupby on One Column

I have a table like below: 我有一个如下表:

Type    Actual  Predicted
A       4       3
A       10      18
A       13      11
B       3       10
B       4       2
B       8       33
C       20      17
C       40      33
C       87      80
C       32      30

I wanted to calculate the R^2 and RMSE for each Type. 我想计算每种类型的R ^ 2和RMSE。 The code to do that is below: 执行此操作的代码如下:

import numpy as np
import pandas as pd
from sklearn.metrics import r2_score, mean_squared_error

def r2_rmse( g ):
    r2 = r2_score( g['Actual'], g['Predicted'] )
    rmse = np.sqrt( mean_squared_error( g['Actual'], g['Predicted'] ) )
    return pd.Series( dict(  r2 = r2, rmse = rmse ) )

your_df.groupby( 'Type' ).apply( r2_rmse ).reset_index()

Sample Output Table (values are hypothetical): 样本输出表(值是假设的):

Type    R^2     RMSE    
A       0.66    4   
B       1.00    6   
C       0.03    1

The above code worked and gave me the output I wanted. 上面的代码有效,并给了我想要的输出。 But now I want to add confidence / prediction intervals into the table at the Type level. 但是现在我想在类型级别的表中添加置信度/预测间隔。 I have literally scoured the internet on how to do this with no luck. 我确实在互联网上搜索了如何做到这一点,但是没有运气。

Conceptual Question: If I want the range of values in which the Actual value is captured with 95% confidence, do I run the confidence interval on the Actual column or the Predicted column? 概念性问题:如果我希望以95%置信度捕获实际值的值范围,我应该在“实际”列还是“预测”列上运行置信区间?

Below is the sample table I want: 以下是我想要的示例表:

Type    Conf_Int_90%  Conf_Int_80%
    A    (21, 100)       (5, 55)
    B    (10, 46)        (3, 14)
    C    (1, 19)         (12, 19)

I have a sense that the confidence interval code is something like this: 我感觉到置信区间代码是这样的:

st.t.interval(0.95, len(a)-1, loc=np.mean(a), scale=st.sem(a)) BUT ... 

What specific code do I incorporate into my existing code (shown above) so I get the table output I want? 我应将哪些特定代码合并到现有代码中(如上所示),以便获得所需的表输出?

try following, and from my understanding, the confidence interval should be operated in the predicted columns. 尝试遵循,据我所知,置信区间应在预测列中进行操作。 Hope it helps you :) 希望它能对您有所帮助:)

import numpy as np
import pandas as pd
import scipy.stats as st
from sklearn.metrics import r2_score, mean_squared_error

def r2_rmse_interval(g):
    r2 = r2_score( g['Actual'], g['Predicted'] )
    rmse = np.sqrt( mean_squared_error( g['Actual'], g['Predicted'] ))
    st_interval = st.t.interval(0.95, len(g) -1, loc=np.mean(g.Predicted), scale=st.sem(g.Predicted))
    return pd.Series( dict(  r2 = r2, rmse = rmse, st_interval = st_interval) )


df = pd.DataFrame({'Type': ['A', 'A', 'A', 'B', 'B', 'B', 'C', 'C', 'C', 'C'],
               'Actual': [4, 10, 13, 3, 4, 8, 20, 40, 87, 32],
               'Predicted': [3, 18, 11, 10, 2, 33, 17, 33, 80, 30]}, 
                columns=['Type', 'Actual', 'Predicted'])

df.groupby( 'Type' ).apply( r2_rmse_interval ).reset_index()

Using the standard formula for 95% CI : 使用95%CI标准公式

sample mean +/- 1.96 * std.err

You can do everything in one go with apply : 您可以使用apply一次性完成所有操作:

def stats(g):
    r2 = r2_score(g.Actual, g.Predicted)
    rmse = np.sqrt(mean_squared_error(g.Actual, g.Predicted))
    ci95_hi = g.Predicted.mean() + g.Predicted.sem() * 1.96
    ci95_lo = g.Predicted.mean() - g.Predicted.sem() * 1.96
    return r2, rmse,(ci95_lo, ci95_hi)

df.groupby("Type").apply(stats)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM