[英]Count occurrences within a specific range
I have a data frame that looks like this:我有一个看起来像这样的数据框:
Tag
0 skip_1
1 run
2 skip_1
3 run
4 skip_1
5 run
6 skip_2
7 run
8 skip_1
9 run
10 skip_2
11 jump
12 skip_1
13 run
14 skip_2
15 jump
16 skip_1
17 run
18 skip_2
19 cleanup_jump
20 skip_1
21 run
22 skip_2
23 run
24 skip_2
25 jump
26 skip_1
27 run
28 skip_2
29 jump
First, I would like to count the RUN occurrences between two JUMP events, then to enumerate this occurrences from the latest to the earliest within this range.首先,我想计算两个 JUMP 事件之间的 RUN 发生次数,然后在此范围内从最近到最早枚举此事件。 The expected results would be:
预期结果将是:
Tag Jump_Run_Count Run_Order
0 skip_1 0 0
1 run 0 5
2 skip_1 0 0
3 run 0 4
4 skip_1 0 0
5 run 0 3
6 skip_2 0 0
7 run 0 2
8 skip_1 0 0
9 run 0 1
10 skip_2 0 0
11 jump 5 0
12 skip_1 0 0
13 run 0 1
14 skip_2 0 0
15 jump 1 0
16 skip_1 0 0
17 run 0 0
18 skip_2 0 0
19 cleanup_jump 0 0
20 skip_1 0 0
21 run 0 2
22 skip_2 0 0
23 run 0 1
24 skip_2 0 0
25 jump 2 0
26 skip_1 0 0
27 run 0 1
28 skip_2 0 0
29 jump 1 0
One of the problems here is that the first RUN occurrences are not within 2 JUMP but are between the first JUMP and the beginning of the column.这里的一个问题是第一个 RUN 出现不在 2 JUMP 内,而是在第一个 JUMP 和列的开头之间。
Secondly I would like to do the same count and enumerate for a CLEANUP_JUMP and JUMP range, and store it in separate columns.其次,我想对 CLEANUP_JUMP 和 JUMP 范围进行相同的计数和枚举,并将其存储在单独的列中。
Tag Jump_Run_Count Run_Order Cleanup_Jump_Dig_Count Run_Order2
0 skip_1 0 0 0 0
1 run 0 5 0 0
2 skip_1 0 0 0 0
3 run 0 4 0 0
4 skip_1 0 0 0 0
5 run 0 3 0 0
6 skip_2 0 0 0 0
7 run 0 2 0 0
8 skip_1 0 0 0 0
9 run 0 1 0 0
10 skip_2 0 0 0 0
11 jump 5 0 0 0
12 skip_1 0 0 0 0
13 run 0 1 0 0
14 skip_2 0 0 0 0
15 jump 1 0 0 0
16 skip_1 0 0 0 0
17 run 0 0 0 1
18 skip_2 0 0 0 0
19 cleanup_jump 0 0 1 0
20 skip_1 0 0 0 0
21 run 0 2 0 0
22 skip_2 0 0 0 0
23 run 0 1 0 0
24 skip_2 0 0 0 0
25 jump 2 0 0 0
26 skip_1 0 0 0 0
27 run 0 1 0 0
28 skip_2 0 0 0 0
29 jump 1 0 0 0
I have added some pictures that might explain it better:我添加了一些可能更好地解释它的图片:
Any help on how to code this, or even another way to approach this issue will be highly appreciated.任何有关如何对此进行编码的帮助,或者甚至是解决此问题的其他方法的任何帮助都将受到高度赞赏。
Thanks!谢谢!
Here is a solution using pandas:这是使用 pandas 的解决方案:
import pandas as pd
import numpy as np
df['run'] = df['Tag'] == 'run'
val_mask = df['Tag'].replace({'cleanup_jump':'jump'}) == 'jump'
df['tag_id'] = (val_mask).cumsum()
df.loc[val_mask, 'Jump_Count'] = df.groupby('tag_id')['run'].sum().to_numpy()[:-1]
df.loc[df['run'], 'run_per_jump'] = df.loc[df['run']].groupby('tag_id')['run'].cumsum()
df['Jump_Run_Order'] = df.groupby('tag_id')['run_per_jump'].rank(method='dense', ascending=False)
jumps_idx = np.flatnonzero(df['Tag'] == 'jump')
cj_idxs = np.flatnonzero(df['Tag'] == 'cleanup_jump')
cj_help_idxs = np.asarray([np.max(jumps_idx[jumps_idx < cj_idx]) for cj_idx in cj_idxs])
for start, end in zip(cj_help_idxs+1, cj_idxs):
df.loc[start:end, 'Cleanup_Jump_Count'] = df.loc[start:end, 'Jump_Count']
df.loc[start:end, 'Cleanup_Jump_Run_Order'] = df.loc[start:end, 'Jump_Run_Order']
df.loc[start:end, 'Jump_Run_Order'] = 0
df.loc[start:end, 'Jump_Count'] = 0
df = df.drop(columns=['tag_id', 'run', 'run_per_jump']).fillna(0).convert_dtypes(convert_integer=True)
print(df)
Tag Jump_Count Jump_Run_Order Cleanup_Jump_Run_Order Cleanup_Jump_Count
0 skip_1 0 0 0 0
1 run 0 5 0 0
2 skip_1 0 0 0 0
3 run 0 4 0 0
4 skip_1 0 0 0 0
5 run 0 3 0 0
6 skip_2 0 0 0 0
7 run 0 2 0 0
8 skip_1 0 0 0 0
9 run 0 1 0 0
10 skip_2 0 0 0 0
11 jump 5 0 0 0
12 skip_1 0 0 0 0
13 run 0 1 0 0
14 skip_2 0 0 0 0
15 jump 1 0 0 0
16 skip_1 0 0 0 0
17 run 0 0 1 0
18 skip_2 0 0 0 0
19 cleanup_jump 0 0 0 1
20 skip_1 0 0 0 0
21 run 0 2 0 0
22 skip_2 0 0 0 0
23 run 0 1 0 0
24 skip_2 0 0 0 0
25 jump 2 0 0 0
26 skip_1 0 0 0 0
27 run 0 1 0 0
28 skip_2 0 0 0 0
29 jump 1 0 0 0
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.