[英]Counting cumulative occurrences of values based on date window in Pandas
I have a DataFrame
( df
) that looks like the following: 我有一个DataFrame
( df
),如下所示:
+----------+----+
| dd_mm_yy | id |
+----------+----+
| 01-03-17 | A |
| 01-03-17 | B |
| 01-03-17 | C |
| 01-05-17 | B |
| 01-05-17 | D |
| 01-07-17 | A |
| 01-07-17 | D |
| 01-08-17 | C |
| 01-09-17 | B |
| 01-09-17 | B |
+----------+----+
This the end result i would like to compute: 这是我想计算的最终结果:
+----------+----+-----------+
| dd_mm_yy | id | cum_count |
+----------+----+-----------+
| 01-03-17 | A | 1 |
| 01-03-17 | B | 1 |
| 01-03-17 | C | 1 |
| 01-05-17 | B | 2 |
| 01-05-17 | D | 1 |
| 01-07-17 | A | 2 |
| 01-07-17 | D | 2 |
| 01-08-17 | C | 1 |
| 01-09-17 | B | 2 |
| 01-09-17 | B | 3 |
+----------+----+-----------+
To calculate the cumulative occurrences of values in id
but within a specified time window, for example 4 months
. 要计算id
但在指定时间范围(例如4 months
内值的累积出现。 ie every 5th month the counter resets to one. 即,每5个月,计数器重置为1。
To get the cumulative occurences we can use this df.groupby('id').cumcount() + 1
要获取累积发生次数,我们可以使用df.groupby('id').cumcount() + 1
Focusing on id = B
we see that the 2nd occurence of B
is after 2 months so the cum_count = 2
. 着眼于id = B
,我们看到的第二个occurence B
2个月,因此后cum_count = 2
。 The next occurence of B
is at 01-09-17
, looking back 4 months we only find one other occurence so cum_count = 2
, etc. B
的下一次出现是在01-09-17
,回首4个月,我们只发现了另一个发生,所以cum_count = 2
,依此cum_count = 2
。
My approach is to call a helper function from df.groupby('id').transform
. 我的方法是从df.groupby('id').transform
调用辅助函数。 I feel this is more complicated and slower than it could be, but it seems to work. 我觉得这比可能要复杂和缓慢,但似乎可行。
# test data
date id cum_count_desired
2017-03-01 A 1
2017-03-01 B 1
2017-03-01 C 1
2017-05-01 B 2
2017-05-01 D 1
2017-07-01 A 2
2017-07-01 D 2
2017-08-01 C 1
2017-09-01 B 2
2017-09-01 B 3
# preprocessing
df['date'] = pd.to_datetime(df['date'])
df.set_index('date', inplace=True)
# Encode the ID strings to numbers to have a column
# to work with after grouping by ID
df['id_code'] = pd.factorize(df['id'])[0]
# solution
def cumcounter(x):
y = [x.loc[d - pd.DateOffset(months=4):d].count() for d in x.index]
gr = x.groupby('date')
adjust = gr.rank(method='first') - gr.size()
y += adjust
return y
df['cum_count'] = df.groupby('id')['id_code'].transform(cumcounter)
# output
df[['id', 'id_num', 'cum_count_desired', 'cum_count']]
id id_num cum_count_desired cum_count
date
2017-03-01 A 0 1 1
2017-03-01 B 1 1 1
2017-03-01 C 2 1 1
2017-05-01 B 1 2 2
2017-05-01 D 3 1 1
2017-07-01 A 0 2 2
2017-07-01 D 3 2 2
2017-08-01 C 2 1 1
2017-09-01 B 1 2 2
2017-09-01 B 1 3 3
adjust
需要adjust
If the same ID occurs multiple times on the same day, the slicing approach that I use will overcount each of the same-day IDs, because the date-based slice immediately grabs all of the same-day values when the list comprehension encounters the date on which multiple IDs show up. 如果同一ID在同一天多次出现,则我使用的切片方法将使每个同一天的ID计数过高,因为当列表推导遇到日期时,基于日期的切片会立即获取所有同一天的值在其中显示多个ID。 Fix: 固定:
y
. 将这些非正整数调整量添加到y
。 This only affects one row in the given test data -- the second-last row, because B
appears twice on the same day. 这只会影响给定测试数据中的一行-第二行,因为B
在同一天出现两次。
To count rows as old as or newer than 4 calendar months ago, ie, to include the left endpoint of the 4-month time interval, leave this line unchanged: 要计算行一样古老或超过 4个日历月前更新 ,即, 包括 4个月的时间间隔的左端点,离开这条线不变:
y = [x.loc[d - pd.DateOffset(months=4):d].count() for d in x.index]
To count rows strictly newer than 4 calendar months ago, ie, to exclude the left endpoint of the 4-month time interval, use this instead: 要对严格比 4个日历月前新的行进行计数(即, 排除 4个月时间间隔的左端点),请改用以下方法:
y = [d.loc[d - pd.DateOffset(months=4, days=-1):d].count() for d in x.index]
You can extend the groupby with a grouper: 您可以使用石斑鱼来扩展groupby:
df['cum_count'] = df.groupby(['id', pd.Grouper(freq='4M', key='date')]).cumcount()
Out[48]:
date id cum_count
0 2017-03-01 A 0
1 2017-03-01 B 0
2 2017-03-01 C 0
3 2017-05-01 B 0
4 2017-05-01 D 0
5 2017-07-01 A 0
6 2017-07-01 D 1
7 2017-08-01 C 0
8 2017-09-01 B 0
9 2017-09-01 B 1
We can make use of .apply row-wise to work on sliced df as well. 我们也可以使用.apply行式处理切片df。 Sliced will be based on the use of relativedelta from dateutil. 切片将基于dateutil中的relativedelta的使用。
def get_cum_sum (slice, row):
if slice.shape[0] == 0:
return 1
return slice[slice['id'] == row.id].shape[0]
d={'dd_mm_yy':['01-03-17','01-03-17','01-03-17','01-05-17','01-05-17','01-07-17','01-07-17','01-08-17','01-09-17','01-09-17'],'id':['A','B','C','B','D','A','D','C','B','B']}
df=pd.DataFrame(data=d)
df['dd_mm_yy'] = pd.to_datetime(df['dd_mm_yy'], format='%d-%m-%y')
df['cum_sum'] = df.apply(lambda current_row: get_cum_sum(df[(df.index <= current_row.name) & (df.dd_mm_yy >= (current_row.dd_mm_yy - relativedelta(months=+4)))],current_row),axis=1)
>>> df
dd_mm_yy id cum_sum
0 2017-03-01 A 1
1 2017-03-01 B 1
2 2017-03-01 C 1
3 2017-05-01 B 2
4 2017-05-01 D 1
5 2017-07-01 A 2
6 2017-07-01 D 2
7 2017-08-01 C 1
8 2017-09-01 B 2
9 2017-09-01 B 3
Thinking if it is feasible to use .rolling but months are not a fixed period thus might not work. 考虑使用.rolling是否可行,但是几个月不是固定期限,因此可能行不通。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.