简体   繁体   English

置信区间 Python dataframe

[英]Confidence Interval in Python dataframe

I am trying to calculate the mean and confidence interval(95%) of a column "Force" in a large dataset.我正在尝试计算大型数据集中“Force”列的均值和置信区间 (95%)。 I need the result by using the groupby function by grouping different "Classes".我需要通过对不同的“类”进行分组来使用 groupby function 的结果。

When I calculate the mean and put it in the new dataframe, it gives me NaN values for all rows.当我计算平均值并将其放入新的 dataframe 时,它会为我提供所有行的 NaN 值。 I'm not sure if I'm going the correct way.我不确定我是否走对了路。 Is there any easier way to do this?有没有更简单的方法来做到这一点?

This is the sample dataframe:这是样本 dataframe:

df=pd.DataFrame({ 'Class': ['A1','A1','A1','A2','A3','A3'], 
                  'Force': [50,150,100,120,140,160] },
                   columns=['Class', 'Force'])

To calculate the confidence interval, the first step I did was to calculate the mean.要计算置信区间,我做的第一步是计算均值。 This is what I used:这是我用的:

F1_Mean = df.groupby(['Class'])['Force'].mean()

This gave me NaN values for all rows.这给了我所有行的NaN值。

import pandas as pd
import numpy as np
import math

df=pd.DataFrame({'Class': ['A1','A1','A1','A2','A3','A3'], 
                 'Force': [50,150,100,120,140,160] },
                 columns=['Class', 'Force'])
print(df)
print('-'*30)

stats = df.groupby(['Class'])['Force'].agg(['mean', 'count', 'std'])
print(stats)
print('-'*30)

ci95_hi = []
ci95_lo = []

for i in stats.index:
    m, c, s = stats.loc[i]
    ci95_hi.append(m + 1.95*s/math.sqrt(c))
    ci95_lo.append(m - 1.95*s/math.sqrt(c))

stats['ci95_hi'] = ci95_hi
stats['ci95_lo'] = ci95_lo
print(stats)

The output is输出是

  Class  Force
0    A1     50
1    A1    150
2    A1    100
3    A2    120
4    A3    140
5    A3    160
------------------------------
       mean  count        std
Class                        
A1      100      3  50.000000
A2      120      1        NaN
A3      150      2  14.142136
------------------------------
       mean  count        std     ci95_hi     ci95_lo
Class                                                
A1      100      3  50.000000  156.291651   43.708349
A2      120      1        NaN         NaN         NaN
A3      150      2  14.142136  169.500000  130.500000

You can simplify @yoonghm solution by taking advantage of 'sem' which is the standard error of the mean.您可以利用“sem”(平均值的标准误差)来简化@yoonghm 解决方案。

import pandas as pd
import numpy as np
import math

df=pd.DataFrame({'Class': ['A1','A1','A1','A2','A3','A3'], 
                 'Force': [50,150,100,120,140,160] },
                 columns=['Class', 'Force'])
print(df)
print('-'*30)

stats = df.groupby(['Class'])['Force'].agg(['mean', 'sem'])
print(stats)
print('-'*30)


stats['ci95_hi'] = stats['mean'] + 1.96* stats['sem']
stats['ci95_lo'] = stats['mean'] - 1.96* stats['sem']
print(stats)

As mentioned in the comments, I could not duplicate your error, but you can try to check that your numbers are stored as numbers and not as strings.正如评论中提到的,我无法复制您的错误,但您可以尝试检查您的数字是否存储为数字而不是字符串。 use df.info() and make sure that the relevant columns are float or int:使用df.info()并确保相关列是 float 或 int:

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6 entries, 0 to 5
Data columns (total 2 columns):
Class    6 non-null object   # <--- non-number column
Force    6 non-null int64    # <--- number (int) column
dtypes: int64(1), object(1)
memory usage: 176.0+ bytes

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM