简体   繁体   English

如何在 Pandas Groupby Python 中执行不同的平均值?

[英]How to perform distinct average in Pandas Groupby in Python?

I have data like this:我有这样的数据:

Input输入

>>> import pandas as pd
>>> df.head(8)
date          id       count
01.02.2020    a        5
01.02.2020    b        10
02.02.2020    a        6
02.02.2020    b        11
03.02.2020    a        9
03.02.2020    a        13
03.02.2020    b        3
03.02.2020    b        5
...

Desired Output所需 Output

date          distinctAverage
01.02.2020    7.5
02.02.2020    8.5
03.02.2020    15         # (9+13+3+5)/2, because 2 distinct entries out of 4 entries
...

Function Function

I want to compute the unique average of "count" for unique IDs in a groupby expression.我想计算 groupby 表达式中唯一 ID 的"count"的唯一平均值。 I group the data like this:我这样分组数据:

df.groupby(
    ["date"]
    ).agg(
        #sumCount=("count", "sum"), # works!
        #countUniqueIDs=("id", lambda x: x.nunique()),  # works!
        distinctAverage=("count", lambda x, y=df["id"]: x.sum() / y.nunique()), # Doesn't work!
        distinctAverage2=("count", "mean") # Doesn't work, takes 4 as the denominator at 03.02.2020
    ).reset_index()

Any idea on how to accomplish a distinct average?关于如何达到不同的平均水平的任何想法?

EDIT: Answer: The distinctAverage as mentioned above works just fine for the sample data.编辑:回答:上面提到的 distinctAverage 对样本数据工作得很好。 In a bigger dataset that can't be displayed here it doesn't work (for whatever reason,): and there is a workaround: After using the groupby and aggregating "sumCount" and "countUniqueIDs" , add another line after the groupby: df["workaroundDistinctAverage"] = df["sumCount"] / df["countUniqueIDs"] Not very elegant, but easier to understand than accepted answer.在此处无法显示的更大的数据集中,它不起作用(无论出于何种原因):并且有一个解决方法:在使用 groupby 并聚合"sumCount""countUniqueIDs"之后,在 groupby 之后添加另一行: df["workaroundDistinctAverage"] = df["sumCount"] / df["countUniqueIDs"]不是很优雅,但比接受的答案更容易理解。

Save the .groupby() return in a variable and then compute what you need with .sum() and .nunique().groupby()返回保存在变量中,然后使用.sum().nunique()计算您需要的内容

grouper = df.groupby(['date'])

(
  (grouper['count'].sum() / grouper['id'].nunique())
  .reset_index(name = 'distinctAverage')
)
#output:
    date        distinctAverage
0   01.02.2020  7.5
1   02.02.2020  8.5
2   03.02.2020  15.0

This works just fine !这很好用!

df.groupby(["date"]).agg(
        distinctAverage=("count", lambda x, y=df["id"]: float(x.sum()/ y.nunique()))
        )

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM