[英]How do I group a pandas DataFrame using one column or another
Dear pandas DataFrame experts,亲爱的 Pandas DataFrame 专家:
I have been using pandas DataFrames to help with re-writing the charting code in an open source project ( https://openrem.org/ , https://bitbucket.org/openrem/openrem ).我一直在使用 pandas DataFrames 来帮助重新编写开源项目( https://openrem.org/ 、 https://bitbucket.org/openrem/openrem )中的图表代码。
I've been grouping and aggregating data over fields such as study_name and x_ray_system_name.我一直在对诸如 study_name 和 x_ray_system_name 等字段的数据进行分组和聚合。
An example dataframe might contain the following data:示例数据框可能包含以下数据:
study_name request_name total_dlp x_ray_system_name
head head 50.0 All systems
head head 100.0 All systems
head NaN 200.0 All systems
blank NaN 75.0 All systems
blank NaN 125.0 All systems
blank head 400.0 All systems
The following line calculates the count and mean of the total_dlp data grouped by x_ray_system_name and study_name:以下行计算按 x_ray_system_name 和 study_name 分组的 total_dlp 数据的计数和平均值:
df.groupby(["x_ray_system_name", "study_name"]).agg({"total_dlp": ["count", "mean"]})
with the following result:结果如下:
total_dlp
count mean
x_ray_system_name study_name
All systems blank 3 200.000000
head 3 116.666667
I now have a need to be able to calculate the mean of the total_dlp data grouped over entries in study_name or request_name.我现在需要能够计算在 study_name或request_name 中的条目上分组的 total_dlp 数据的平均值。 So in the example above, I'd like the "head" mean to include the three study_name "head" entries, and also the single request_name "head" entry.
所以在上面的例子中,我希望“head”意味着包括三个 study_name“head”条目,以及单个 request_name“head”条目。
I would like the results to look something like this:我希望结果看起来像这样:
total_dlp
count mean
x_ray_system_name name
All systems blank 3 200.000000
head 4 187.500000
Does anyone know how I can carry out a groupby based on categories in one field or another?有谁知道我如何根据一个领域或另一个领域的类别进行分组?
Any help you can offer will be very much appreciated.您可以提供的任何帮助将不胜感激。
Kind regards,亲切的问候,
David大卫
You (groupby) data is essentially union of:您(groupby)数据本质上是以下各项的并集:
study_name == request_name
study_name == request_name
那些study_name != request_name
, one for study_name
, one for request_name
study_name != request_name
,一个用于study_name
,一个用于request_name
We can duplicate the data with melt
我们可以用
melt
复制数据
(pd.concat([df.query('study_name==request_name') # equal part
.drop('request_name', axis=1), # remove so `melt` doesn't duplicate this data
df.query('study_name!=request_name')]) # not equal part
.melt(['x_ray_system_name','total_dlp']) # melt to duplicate
.groupby(['x_ray_system_name','value'])
['total_dlp'].mean()
)
Update : editing the above code helps me realize that we could simplify do:更新:编辑上面的代码帮助我意识到我们可以简化:
# mask `request_name` with `NaN` where they equal `study_name`
# so they are ignored when duplicate/mean
(df.assign(request_name=df.request_name.mask(df.study_name==df.request_name))
.melt(['x_ray_system_name','total_dlp'])
.groupby(['x_ray_system_name','value'])
['total_dlp'].mean()
)
Output:输出:
x_ray_system_name value
All systems blank 200.0
head 187.5
Name: total_dlp, dtype: float64
I have a similar approach to that of @QuangHoang but with a different order of the operations.我有与@QuangHoang 类似的方法,但操作顺序不同。
I am using here the original (range) index to chose how to drop the duplicate data.我在这里使用原始(范围)索引来选择如何删除重复数据。
You can melt
, drop_duplicates
and dropna
and groupby
:你可以
melt
, drop_duplicates
和dropna
和groupby
:
(df.reset_index()
.melt(id_vars=['index', 'total_dlp', 'x_ray_system_name'])
.drop_duplicates(['index', 'value'])
.dropna(subset=['value'])
.groupby(["x_ray_system_name", 'value'])
.agg({"total_dlp": ["count", "mean"]})
)
output:输出:
total_dlp
count mean
x_ray_system_name value
All systems blank 3 200.0
head 4 187.5
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.