[英]pandas groupby on date to give rolling data slices
I have sports data, exemplified by a running group with distance values associated to date of run and runner's name as per:我有运动数据,例如一个跑步组,其距离值与跑步日期和跑步者姓名相关,如下所示:
import pandas as pd
df=pd.DataFrame({'name': 'Jack Jill Bob Bella Norm Nella Jack Jill Bob Bella Norm Nella Jack Jill Bob Bella Norm Nella'.split(),
'date': '05-04-2021 05-04-2021 05-04-2021 06-04-2021 05-04-2021 06-04-2021 06-04-2021 08-04-2021 11-04-2021 08-04-2021 11-04-2021 08-04-2021 11-04-2021 11-04-2021 15-04-2022 15-04-2022 18-04-2022 19-04-2022'.split(),
'km': [5.85, 5.18, 13.58, 14.45, 14.58, 11.14, 8.85, 10.77, 12.54, 7.09, 7.69, 11.64, 9.82, 11.20, 10.33, 11.31, 14.66, 12.56]})
df['date']=pd.to_datetime(df['date'], infer_datetime_format=True)
I would like to groupby and filter date to provide a rolling, enlarging slice of data to aggregate on.我想对日期进行分组和过滤,以提供滚动、放大的数据片段以进行聚合。 I can do this using a loop and filtering on each unique date, which provides a series of summed km values with unique date subsequently added in as a separate column.我可以使用循环并对每个唯一日期进行过滤来执行此操作,这提供了一系列求和的公里值,其中唯一日期随后作为单独的列添加。 The type of data and format I'm after is provided by this code.此代码提供了我所追求的数据类型和格式。
for d in df.date.unique():
rolling=df[df.date <= d].groupby('name').sum()
rolling['date']=d
I would like to accomplish using .groupby(), as I have much more data and complexity in what I actually want to do.我想使用 .groupby() 来完成,因为我真正想做的事情有更多的数据和复杂性。 Happy to be guided to a pre-existing answer that I haven't found after searching...很高兴被引导到我搜索后没有找到的预先存在的答案......
The expected output is unclear, but assuming you want the cumulated km for each name for each date, you could use:预期的输出尚不清楚,但假设您想要每个日期的每个名称的累积公里数,您可以使用:
out = (df
.groupby(['name', 'date']).sum()
.groupby(level='name').cumsum()
.reset_index()
)
output:输出:
name date km
0 Bella 2021-06-04 14.45
1 Bella 2021-08-04 21.54
2 Bella 2022-04-15 32.85
3 Bob 2021-05-04 13.58
4 Bob 2021-11-04 26.12
5 Bob 2022-04-15 36.45
6 Jack 2021-05-04 5.85
7 Jack 2021-06-04 14.70
8 Jack 2021-11-04 24.52
9 Jill 2021-05-04 5.18
10 Jill 2021-08-04 15.95
11 Jill 2021-11-04 27.15
12 Nella 2021-06-04 11.14
13 Nella 2021-08-04 22.78
14 Nella 2022-04-19 35.34
15 Norm 2021-05-04 14.58
16 Norm 2021-11-04 22.27
17 Norm 2022-04-18 36.93
The above output could conveniently be seen as a 2D table using pivot
:上面的输出可以方便地视为使用pivot
的 2D 表:
out2 = (df
.groupby(['name', 'date']).sum()
.groupby(level='name').cumsum()
.reset_index()
.pivot(index='date', columns='name', values='km')
)
output:输出:
name Bella Bob Jack Jill Nella Norm
date
2021-05-04 NaN 13.58 5.85 5.18 NaN 14.58
2021-06-04 14.45 NaN 14.70 NaN 11.14 NaN
2021-08-04 21.54 NaN NaN 15.95 22.78 NaN
2021-11-04 NaN 26.12 24.52 27.15 NaN 22.27
2022-04-15 32.85 36.45 NaN NaN NaN NaN
2022-04-18 NaN NaN NaN NaN NaN 36.93
2022-04-19 NaN NaN NaN NaN 35.34 NaN
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.