简体   繁体   English

熊猫从加入日期开始计算每天的事件

[英]Pandas count event per day from join date

I have this data frame:我有这个数据框:

name    event     join_date    created_at    
A       X         2020-12-01   2020-12-01
A       X         2020-12-01   2020-12-01
A       X         2020-12-01   2020-12-02
A       Y         2020-12-01   2020-12-02
B       X         2020-12-05   2020-12-05
B       X         2020-12-05   2020-12-07
C       X         2020-12-07   2020-12-08
C       X         2020-12-07   2020-12-09
...

I want to transform it into this data frame:我想把它转换成这个数据框:

name   event    join_date    day_0   day_1    day_2 .... day_n
A      X        2020-12-01   2       1        0          0
A      Y        2020-12-01   0       1        0          0
B      X        2020-12-05   1       0        1          0
C      X        2020-12-07   0       1        1          0
...

the first rows mean that user A doing twice Event X on day_0 (first day he joins) and once on the first day and so on until day_n第一行表示用户 A 在 day_0(他加入的第一天)执行两次 Event X,在第一天执行一次,以此类推直到 day_n

For now, the result is like this:目前,结果是这样的:

name   event    join_date    day_0   day_1    day_2 .... day_n
A      X        2020-12-01   2       1        0          0
A      Y        2020-12-01   0       1        0          0
B      X        2020-12-05   1       0        1          0
C      X        2020-12-07   1       1        0          0
...

the code set the 2020-12-02 as day_0, not day_1 because there is no 2020-12-01 on A user with Y event代码将 2020-12-02 设置为 day_0,而不是 day_1,因为在具有 Y 事件的用户上没有 2020-12-01

First subtract all values created_at by first value per groups by GroupBy.transform .首先通过GroupBy.transform减去每个组的第一个值created_at

Then use DataFrame.pivot_table first, add all possible datetimes by DataFrame.reindex by timedelta_range and then convert columns names by range :然后首先使用DataFrame.pivot_table ,通过DataFrame.reindex通过timedelta_range添加所有可能的日期timedelta_range ,然后通过range转换列名:

df['d'] = df['created_at'].sub(df['join_date'])
print (df)
  name event  join_date created_at      d
0    A     X 2020-12-01 2020-12-01 0 days
1    A     X 2020-12-01 2020-12-01 0 days
2    A     X 2020-12-01 2020-12-02 1 days
3    A     Y 2020-12-01 2020-12-02 1 days
4    B     X 2020-12-05 2020-12-05 0 days
5    B     X 2020-12-05 2020-12-07 2 days
6    C     X 2020-12-07 2020-12-08 1 days
7    C     X 2020-12-07 2020-12-09 2 days

df1 = (df.pivot_table(index=['name','event','join_date'], 
                     columns='d', 
                     aggfunc='size', 
                     fill_value=0)
         .reindex(pd.timedelta_range(df['d'].min(), df['d'].max()), 
                  axis=1, 
                  fill_value=0))
df1.columns = [f'day_{i}' for i in range(len(df1.columns))]
df1 = df1.reset_index()
print (df1)
  name event  join_date  day_0  day_1  day_2
0    A     X 2020-12-01      2      1      0
1    A     Y 2020-12-01      0      1      0
2    B     X 2020-12-05      1      0      1
3    C     X 2020-12-07      0      1      1

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM