如何使用一列或另一列对 Pandas DataFrame 进行分组

Question

Dear pandas DataFrame experts,亲爱的 Pandas DataFrame 专家：

I have been using pandas DataFrames to help with re-writing the charting code in an open source project ( https://openrem.org/ , https://bitbucket.org/openrem/openrem ).我一直在使用 pandas DataFrames 来帮助重新编写开源项目（ https://openrem.org/ 、 https://bitbucket.org/openrem/openrem ）中的图表代码。

I've been grouping and aggregating data over fields such as study_name and x_ray_system_name.我一直在对诸如 study_name 和 x_ray_system_name 等字段的数据进行分组和聚合。

An example dataframe might contain the following data:示例数据框可能包含以下数据：

study_name   request_name   total_dlp   x_ray_system_name
      head           head        50.0         All systems
      head           head       100.0         All systems
      head            NaN       200.0         All systems
     blank            NaN        75.0         All systems
     blank            NaN       125.0         All systems
     blank           head       400.0         All systems

The following line calculates the count and mean of the total_dlp data grouped by x_ray_system_name and study_name:以下行计算按 x_ray_system_name 和 study_name 分组的 total_dlp 数据的计数和平均值：

df.groupby(["x_ray_system_name", "study_name"]).agg({"total_dlp": ["count", "mean"]})

with the following result:结果如下：

                                 total_dlp
                                     count         mean
x_ray_system_name   study_name   
All systems         blank                3   200.000000
                    head                 3   116.666667

I now have a need to be able to calculate the mean of the total_dlp data grouped over entries in study_name or request_name.我现在需要能够计算在 study_name或request_name 中的条目上分组的 total_dlp 数据的平均值。 So in the example above, I'd like the "head" mean to include the three study_name "head" entries, and also the single request_name "head" entry.所以在上面的例子中，我希望“head”意味着包括三个 study_name“head”条目，以及单个 request_name“head”条目。

I would like the results to look something like this:我希望结果看起来像这样：

                                 total_dlp
                                     count         mean
x_ray_system_name   name   
All systems         blank                3   200.000000
                    head                 4   187.500000

Does anyone know how I can carry out a groupby based on categories in one field or another?有谁知道我如何根据一个领域或另一个领域的类别进行分组？

Any help you can offer will be very much appreciated.您可以提供的任何帮助将不胜感激。

Kind regards,亲切的问候，

David大卫

Answer 1

You (groupby) data is essentially union of:您（groupby）数据本质上是以下各项的并集：

extract those with study_name == request_name提取带有study_name == request_name那些
duplicate those with study_name != request_name , one for study_name , one for request_name复制那些带有study_name != request_name ，一个用于study_name ，一个用于request_name

We can duplicate the data with melt我们可以用melt复制数据

(pd.concat([df.query('study_name==request_name')    # equal part
              .drop('request_name', axis=1),        # remove so `melt` doesn't duplicate this data
            df.query('study_name!=request_name')])  # not equal part
   .melt(['x_ray_system_name','total_dlp'])         # melt to duplicate
   .groupby(['x_ray_system_name','value'])
   ['total_dlp'].mean()
)

Update : editing the above code helps me realize that we could simplify do:更新：编辑上面的代码帮助我意识到我们可以简化：

# mask `request_name` with `NaN` where they equal `study_name`
# so they are ignored when duplicate/mean
(df.assign(request_name=df.request_name.mask(df.study_name==df.request_name))
   .melt(['x_ray_system_name','total_dlp']) 
   .groupby(['x_ray_system_name','value'])
   ['total_dlp'].mean()
)

Output:输出：

x_ray_system_name  value
All systems        blank    200.0
                   head     187.5
Name: total_dlp, dtype: float64

Answer 2

I have a similar approach to that of @QuangHoang but with a different order of the operations.我有与@QuangHoang 类似的方法，但操作顺序不同。

I am using here the original (range) index to chose how to drop the duplicate data.我在这里使用原始（范围）索引来选择如何删除重复数据。

You can melt , drop_duplicates and dropna and groupby :你可以melt , drop_duplicates和dropna和groupby ：

(df.reset_index()
   .melt(id_vars=['index', 'total_dlp', 'x_ray_system_name'])
   .drop_duplicates(['index', 'value'])
   .dropna(subset=['value'])
   .groupby(["x_ray_system_name", 'value'])
   .agg({"total_dlp": ["count", "mean"]})
)

output:输出：

                        total_dlp       
                            count   mean
x_ray_system_name value                 
All systems       blank         3  200.0
                  head          4  187.5

如何使用一列或另一列对 Pandas DataFrame 进行分组

问题描述

2 个解决方案

解决方案1
5 2021-11-03 14:28:37

解决方案2
2 已采纳 2021-11-03 14:34:42

如何使用一列或另一列对 Pandas DataFrame 进行分组

问题描述

2 个解决方案

解决方案1 5 2021-11-03 14:28:37

解决方案2 2 已采纳 2021-11-03 14:34:42

解决方案1
5 2021-11-03 14:28:37

解决方案2
2 已采纳 2021-11-03 14:34:42