简体   繁体   English

根据熊猫中的日期窗口对值的累积出现进行计数

[英]Counting cumulative occurrences of values based on date window in Pandas

I have a DataFrame ( df ) that looks like the following: 我有一个DataFramedf ),如下所示:

+----------+----+
| dd_mm_yy | id |
+----------+----+
| 01-03-17 | A  |
| 01-03-17 | B  |
| 01-03-17 | C  |
| 01-05-17 | B  |
| 01-05-17 | D  |
| 01-07-17 | A  |
| 01-07-17 | D  |
| 01-08-17 | C  |
| 01-09-17 | B  |
| 01-09-17 | B  |
+----------+----+

This the end result i would like to compute: 这是我想计算的最终结果:

+----------+----+-----------+
| dd_mm_yy | id | cum_count |
+----------+----+-----------+
| 01-03-17 | A  |         1 |
| 01-03-17 | B  |         1 |
| 01-03-17 | C  |         1 |
| 01-05-17 | B  |         2 |
| 01-05-17 | D  |         1 |
| 01-07-17 | A  |         2 |
| 01-07-17 | D  |         2 |
| 01-08-17 | C  |         1 |
| 01-09-17 | B  |         2 |
| 01-09-17 | B  |         3 |
+----------+----+-----------+

Logic 逻辑

To calculate the cumulative occurrences of values in id but within a specified time window, for example 4 months . 要计算id但在指定时间范围(例如4 months内值的累积出现。 ie every 5th month the counter resets to one. 即,每5个月,计数器重置为1。

To get the cumulative occurences we can use this df.groupby('id').cumcount() + 1 要获取累积发生次数,我们可以使用df.groupby('id').cumcount() + 1

Focusing on id = B we see that the 2nd occurence of B is after 2 months so the cum_count = 2 . 着眼于id = B ,我们看到的第二个occurence B 2个月,因此后cum_count = 2 The next occurence of B is at 01-09-17 , looking back 4 months we only find one other occurence so cum_count = 2 , etc. B的下一次出现是在01-09-17 ,回首4个月,我们只发现了另一个发生,所以cum_count = 2 ,依此cum_count = 2

My approach is to call a helper function from df.groupby('id').transform . 我的方法是从df.groupby('id').transform调用辅助函数。 I feel this is more complicated and slower than it could be, but it seems to work. 我觉得这比可能要复杂和缓慢,但似乎可行。

# test data

    date    id  cum_count_desired
2017-03-01  A   1
2017-03-01  B   1
2017-03-01  C   1
2017-05-01  B   2
2017-05-01  D   1
2017-07-01  A   2
2017-07-01  D   2
2017-08-01  C   1
2017-09-01  B   2
2017-09-01  B   3

# preprocessing

df['date'] = pd.to_datetime(df['date'])
df.set_index('date', inplace=True)
# Encode the ID strings to numbers to have a column
# to work with after grouping by ID
df['id_code'] = pd.factorize(df['id'])[0]

# solution

def cumcounter(x):
    y = [x.loc[d - pd.DateOffset(months=4):d].count() for d in x.index]
    gr = x.groupby('date')
    adjust = gr.rank(method='first') - gr.size() 
    y += adjust
    return y

df['cum_count'] = df.groupby('id')['id_code'].transform(cumcounter)

# output

df[['id', 'id_num', 'cum_count_desired', 'cum_count']]

           id  id_num  cum_count_desired  cum_count
date                                               
2017-03-01  A       0                  1          1
2017-03-01  B       1                  1          1
2017-03-01  C       2                  1          1
2017-05-01  B       1                  2          2
2017-05-01  D       3                  1          1
2017-07-01  A       0                  2          2
2017-07-01  D       3                  2          2
2017-08-01  C       2                  1          1
2017-09-01  B       1                  2          2
2017-09-01  B       1                  3          3

The need for adjust 需要adjust

If the same ID occurs multiple times on the same day, the slicing approach that I use will overcount each of the same-day IDs, because the date-based slice immediately grabs all of the same-day values when the list comprehension encounters the date on which multiple IDs show up. 如果同一ID在同一天多次出现,则我使用的切片方法将使每个同一天的ID计数过高,因为当列表推导遇到日期时,基于日期的切片会立即获取所有同一天的值在其中显示多个ID。 Fix: 固定:

  1. Group the current DataFrame by date. 按日期对当前DataFrame进行分组。
  2. Rank each row in each date group. 对每个日期组中的每一行进行排名。
  3. Subtract from these ranks the total number of rows in each date group. 从这些排名中减去每个日期组中的总行数。 This produces a date-indexed Series of ascending negative integers, ending at 0. 这将产生一个以日期索引的负整数递增系列,以0结尾。
  4. Add these non-positive integer adjustments to y . 将这些非正整数调整量添加到y

This only affects one row in the given test data -- the second-last row, because B appears twice on the same day. 这只会影响给定测试数据中的一行-第二行,因为B在同一天出现两次。

Including or excluding the left endpoint of the time interval 包括或排除时间间隔的左端点

To count rows as old as or newer than 4 calendar months ago, ie, to include the left endpoint of the 4-month time interval, leave this line unchanged: 要计算行一样古老或超过 4个日历月前更新 ,即, 包括 4个月的时间间隔的左端点,离开这条线不变:

y = [x.loc[d - pd.DateOffset(months=4):d].count() for d in x.index]

To count rows strictly newer than 4 calendar months ago, ie, to exclude the left endpoint of the 4-month time interval, use this instead: 要对严格比 4个日历月前新的行进行计数(即, 排除 4个月时间间隔的左端点),请改用以下方法:

y = [d.loc[d - pd.DateOffset(months=4, days=-1):d].count() for d in x.index]

You can extend the groupby with a grouper: 您可以使用石斑鱼来扩展groupby:

df['cum_count'] = df.groupby(['id', pd.Grouper(freq='4M', key='date')]).cumcount()

Out[48]: 
        date id  cum_count
0 2017-03-01  A          0
1 2017-03-01  B          0
2 2017-03-01  C          0
3 2017-05-01  B          0
4 2017-05-01  D          0
5 2017-07-01  A          0
6 2017-07-01  D          1
7 2017-08-01  C          0
8 2017-09-01  B          0
9 2017-09-01  B          1

We can make use of .apply row-wise to work on sliced df as well. 我们也可以使用.apply行式处理切片df。 Sliced will be based on the use of relativedelta from dateutil. 切片将基于dateutil中的relativedelta的使用。

def get_cum_sum (slice, row):
    if slice.shape[0] == 0:
        return 1
    return slice[slice['id'] == row.id].shape[0]

d={'dd_mm_yy':['01-03-17','01-03-17','01-03-17','01-05-17','01-05-17','01-07-17','01-07-17','01-08-17','01-09-17','01-09-17'],'id':['A','B','C','B','D','A','D','C','B','B']}
df=pd.DataFrame(data=d)
df['dd_mm_yy'] = pd.to_datetime(df['dd_mm_yy'], format='%d-%m-%y')

df['cum_sum'] = df.apply(lambda current_row: get_cum_sum(df[(df.index <= current_row.name) & (df.dd_mm_yy >= (current_row.dd_mm_yy - relativedelta(months=+4)))],current_row),axis=1)

>>> df
    dd_mm_yy id  cum_sum
0 2017-03-01  A        1
1 2017-03-01  B        1
2 2017-03-01  C        1
3 2017-05-01  B        2
4 2017-05-01  D        1
5 2017-07-01  A        2
6 2017-07-01  D        2
7 2017-08-01  C        1
8 2017-09-01  B        2
9 2017-09-01  B        3

Thinking if it is feasible to use .rolling but months are not a fixed period thus might not work. 考虑使用.rolling是否可行,但是几个月不是固定期限,因此可能行不通。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM