简体   繁体   English

寻找一种从数据框中的列生成统计表的方法

[英]Looking for a way to produce a table of statistics from columns in a data frame

I have a data set with categories/codes eg male/female, state of service, code of service and I have a column of paid claims. 我有一个包含类别/代码的数据集,例如,男性/女性,服务状态,服务代码,并且有一列已付款的索赔。

I am looking for a way to create a table/pivot using Python to generate outputs where I only have the top 10 highest column of average paid claims by code of service (ie what are the top 10 codes with highest average paid claims). 我正在寻找一种使用Python创建表/数据透视表的方式来生成输出,其中我仅具有按服务代码分类的平均已付费索赔的前10名最高列(即,具有最高平均已付费索赔的前10个代码是什么)。 I also wanted to append with median, stdev, counts so the output looks something like 我还想附加中位数,stdev和计数,以便输出看起来像

Table: 表:

gender, code, state, paid claim
F, 1234, TX, $300
F, 2345, NJ, $120
F, 3456, NJ, $30
M, 1234, MN, $250
M, 4567, CA, $50
F, 1234, MA, $70
F, 8901, CA, $150
F, 23457, NY, $160
F, 4567, SD, $125

Output I am trying to generate (top 10 ave paid claim by code): 我正在尝试生成的输出(按代码排在前10位的已付费索赔):

code, average claim, median claim, count claim
1234,  206, xxx, 3

So, I did something like: 因此,我做了类似的事情:

service_code_average=df.groupby('service_code', as_index=False)['paid claim'].mean().sort_values(by='paid claim')

I was not able to limit to top 10 and I was struggling to append the media and count. 我无法将排名限制在前10位,而且我还在努力增加媒体的数量。

Here you can leverage agg function where you can specify multiple aggregation function in one go. 在这里,您可以利用agg函数,在其中可以一次性指定多个聚合函数。 You can do the following: 您可以执行以下操作:

# convert string to integer
df['paid claim'] = df['paid claim'].str.extract('(\d+)')
df['paid claim'] = df['paid claim'].astype(int)

# set n value
top_n = 2 ## set this to 10 

# apply aggregation 
df1 = df.groupby('code').agg({'paid claim':{'average': lambda x: x.nlargest(top_n).mean(),
                                      'counts': lambda x: x.count(),
                                      'median': lambda x: x.median()}})

# reset column names
df1.columns = df1.columns.droplevel()
df1 = df1.reset_index()

print(df1)

    code  average  counts  median
0   1234    275.0       3   250.0
1   2345    120.0       1   120.0
2   3456     30.0       1    30.0
3   4567     87.5       2    87.5
4   8901    150.0       1   150.0
5  23457    160.0       1   160.0

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 比较数据框中的两列,如果它们相等则产生 1 或 0 - compare two columns in data frame, then produce 1 or 0 if they are equal or not 使用 PySpark 数据框的统计信息创建 Pandas 数据框 - Create Pandas data frame with statistics from PySpark data frame 如何循环遍历系列以生成数据框并向其中添加列? - How to loop through series to produce a data frame and add columns to it? 有没有一种简单的方法可以从另一个数据帧连接 pandas 中的列? - Is there a simple way to concatenate columns in pandas from another data frame? 熊猫:从数据框匹配列创建表到列表 - Pandas: Create table from data frame matching columns to a list 寻找一种更快的方法在数据框中创建新列,其中包含来自另一列行的字典值 - Looking for a faster way to create a new column in a data frame containing a dictionary values from the rows of another column 寻找从 dataframe 中的列中删除子集的方法 - Looking for a way to remove subsets from columns in dataframe 定义一个函数(在 PYTHON 中)以将来自不同模型的统计信息作为表中的列插入 - Defining a function (in PYTHON) to insert statistics from different models as columns in a table 基于列数据生成列描述的优雅方法 - Elegant way to produce description of columns based on column data 寻找正则表达式从数据框中删除可预测的文本块 - Looking to Regex Strip a predictable chunk of text from data frame
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM