简体   繁体   English

如何使用一列或另一列对 Pandas DataFrame 进行分组

[英]How do I group a pandas DataFrame using one column or another

Dear pandas DataFrame experts,亲爱的 Pandas DataFrame 专家:

I have been using pandas DataFrames to help with re-writing the charting code in an open source project ( https://openrem.org/ , https://bitbucket.org/openrem/openrem ).我一直在使用 pandas DataFrames 来帮助重新编写开源项目( https://openrem.org/https://bitbucket.org/openrem/openrem )中的图表代码。

I've been grouping and aggregating data over fields such as study_name and x_ray_system_name.我一直在对诸如 study_name 和 x_ray_system_name 等字段的数据进行分组和聚合。

An example dataframe might contain the following data:示例数据框可能包含以下数据:

study_name   request_name   total_dlp   x_ray_system_name
      head           head        50.0         All systems
      head           head       100.0         All systems
      head            NaN       200.0         All systems
     blank            NaN        75.0         All systems
     blank            NaN       125.0         All systems
     blank           head       400.0         All systems

The following line calculates the count and mean of the total_dlp data grouped by x_ray_system_name and study_name:以下行计算按 x_ray_system_name 和 study_name 分组的 total_dlp 数据的计数和平均值:

df.groupby(["x_ray_system_name", "study_name"]).agg({"total_dlp": ["count", "mean"]})

with the following result:结果如下:

                                 total_dlp
                                     count         mean
x_ray_system_name   study_name   
All systems         blank                3   200.000000
                    head                 3   116.666667

I now have a need to be able to calculate the mean of the total_dlp data grouped over entries in study_name or request_name.我现在需要能够计算在 study_namerequest_name 中的条目上分组的 total_dlp 数据的平均值。 So in the example above, I'd like the "head" mean to include the three study_name "head" entries, and also the single request_name "head" entry.所以在上面的例子中,我希望“head”意味着包括三个 study_name“head”条目,以及单个 request_name“head”条目。

I would like the results to look something like this:我希望结果看起来像这样:

                                 total_dlp
                                     count         mean
x_ray_system_name   name   
All systems         blank                3   200.000000
                    head                 4   187.500000

Does anyone know how I can carry out a groupby based on categories in one field or another?有谁知道我如何根据一个领域或另一个领域的类别进行分组?

Any help you can offer will be very much appreciated.您可以提供的任何帮助将不胜感激。

Kind regards,亲切的问候,

David大卫

You (groupby) data is essentially union of:您(groupby)数据本质上是以下各项的并集:

  1. extract those with study_name == request_name提取带有study_name == request_name那些
  2. duplicate those with study_name != request_name , one for study_name , one for request_name复制那些带有study_name != request_name ,一个用于study_name ,一个用于request_name

We can duplicate the data with melt我们可以用melt复制数据

(pd.concat([df.query('study_name==request_name')    # equal part
              .drop('request_name', axis=1),        # remove so `melt` doesn't duplicate this data
            df.query('study_name!=request_name')])  # not equal part
   .melt(['x_ray_system_name','total_dlp'])         # melt to duplicate
   .groupby(['x_ray_system_name','value'])
   ['total_dlp'].mean()
)

Update : editing the above code helps me realize that we could simplify do:更新:编辑上面的代码帮助我意识到我们可以简化:

# mask `request_name` with `NaN` where they equal `study_name`
# so they are ignored when duplicate/mean
(df.assign(request_name=df.request_name.mask(df.study_name==df.request_name))
   .melt(['x_ray_system_name','total_dlp']) 
   .groupby(['x_ray_system_name','value'])
   ['total_dlp'].mean()
)

Output:输出:

x_ray_system_name  value
All systems        blank    200.0
                   head     187.5
Name: total_dlp, dtype: float64

I have a similar approach to that of @QuangHoang but with a different order of the operations.我有与@QuangHoang 类似的方法,但操作顺序不同。

I am using here the original (range) index to chose how to drop the duplicate data.我在这里使用原始(范围)索引来选择如何删除重复数据。

You can melt , drop_duplicates and dropna and groupby :你可以melt , drop_duplicatesdropnagroupby

(df.reset_index()
   .melt(id_vars=['index', 'total_dlp', 'x_ray_system_name'])
   .drop_duplicates(['index', 'value'])
   .dropna(subset=['value'])
   .groupby(["x_ray_system_name", 'value'])
   .agg({"total_dlp": ["count", "mean"]})
)

output:输出:

                        total_dlp       
                            count   mean
x_ray_system_name value                 
All systems       blank         3  200.0
                  head          4  187.5

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 如果使用熊猫在另一个数据帧中不存在列值,如何将它们从一个数据帧合并到另一个数据帧 - How do I merge column values from one dataframe to another if they are not present in another using pandas 在 Pandas Dataframe 中按一列排序,然后按另一列分组? - Sort by one column, then group by another, in Pandas Dataframe? 如何使用 python 拆分 pandas 中的 dataframe 列值以获取另一列? - How do I split a dataframe column values in pandas to get another column using python? 将Pandas数据帧分组一列,根据另一列删除行 - Group Pandas dataframe by one column, drop rows based on another column 使用一个 Pandas 数据框填充另一个 Pandas 数据框的新列 - Using one pandas dataframe to populate new column in another pandas dataframe 如何将 Pandas dataframe 线性下采样到另一个列集? - How do I downsample a Pandas dataframe linearly to another column set? 如何从一个数据框中的列中提取特定值并将它们附加到另一个数据框中的列中? - 熊猫 - How do you extract specific values from a column in one dataframe and append them to a column in another dataframe? - Pandas 如何将一个熊猫数据框的一列与另一个数据框的每一列相加? - How to sum a column of one pandas dataframe to each column of another dataframe? 如何将一行从一个熊猫数据帧复制到另一个熊猫数据帧? - How do I copy a row from one pandas dataframe to another pandas dataframe? 如何将一列的Pandas数据框转换为两列的Pandas数据框? - How do I convert a Pandas Dataframe with one column into a Pandas Dataframe of two columns?
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM