[英]Pandas groupby and shift spilling between groups
I'm having an issue with Pandas where the combination of groupby and shift seems to have data spilled between the groups.我遇到了 Pandas 的问题,其中 groupby 和 shift 的组合似乎在组之间溢出了数据。
Here's a reproducible example:这是一个可重现的示例:
from pandas import Timestamp
sample = {'start': {0: Timestamp('2022-08-02 07:20:00'),
1: Timestamp('2022-08-02 07:25:00'),
2: Timestamp('2022-08-02 07:26:00'),
3: Timestamp('2022-08-02 07:35:00'),
4: Timestamp('2022-08-02 08:20:00'),
5: Timestamp('2022-08-02 08:25:00'),
6: Timestamp('2022-08-02 08:26:00'),
7: Timestamp('2022-08-02 08:35:00')},
'end': {0: Timestamp('2022-08-02 07:30:00'),
1: Timestamp('2022-08-02 07:35:00'),
2: Timestamp('2022-08-02 12:34:00'),
3: Timestamp('2022-08-02 07:40:00'),
4: Timestamp('2022-08-02 08:30:00'),
5: Timestamp('2022-08-02 08:55:00'),
6: Timestamp('2022-08-02 08:34:00'),
7: Timestamp('2022-08-02 08:40:00')},
'group': {0: 'G1',
1: 'G1',
2: 'G1',
3: 'G1',
4: 'G2',
5: 'G2',
6: 'G2',
7: 'G2'}}
df = pd.DataFrame(sample)
df = df.sort_values('start')
df['notworking'] = df.groupby('group')['end'].shift().cummax()
This gives the following output这给出了以下 output
start end group notworking
0 2022-08-02 07:20:00 2022-08-02 07:30:00 G1
1 2022-08-02 07:25:00 2022-08-02 07:35:00 G1 2022-08-02 07:30:00
2 2022-08-02 07:26:00 2022-08-02 12:34:00 G1 2022-08-02 07:35:00
3 2022-08-02 07:35:00 2022-08-02 07:40:00 G1 2022-08-02 12:34:00
4 2022-08-02 08:20:00 2022-08-02 08:30:00 G2
5 2022-08-02 08:25:00 2022-08-02 08:55:00 G2 2022-08-02 12:34:00
6 2022-08-02 08:26:00 2022-08-02 08:34:00 G2 2022-08-02 12:34:00
7 2022-08-02 08:35:00 2022-08-02 08:40:00 G2 2022-08-02 12:34:00
The 'end'
at index 2 is correctly assigned to 'notworking'
at index 3, but this value persists over in the next group.索引 2 处的
'end'
被正确分配给索引 3 处'notworking'
,但该值在下一组中仍然存在。
My desired outcome is for cummax() to start fresh for each group, like this:我想要的结果是让 cummax() 为每个组重新开始,如下所示:
start end group notworking
0 2022-08-02 07:20:00 2022-08-02 07:30:00 G1
1 2022-08-02 07:25:00 2022-08-02 07:35:00 G1 2022-08-02 07:30:00
2 2022-08-02 07:26:00 2022-08-02 12:34:00 G1 2022-08-02 07:35:00
3 2022-08-02 07:35:00 2022-08-02 07:40:00 G1 2022-08-02 12:34:00
4 2022-08-02 08:20:00 2022-08-02 08:30:00 G2
5 2022-08-02 08:25:00 2022-08-02 08:55:00 G2 2022-08-02 08:30:00
6 2022-08-02 08:26:00 2022-08-02 08:34:00 G2 2022-08-02 08:55:00
7 2022-08-02 08:35:00 2022-08-02 08:40:00 G2 2022-08-02 08:55:00
I guess this is simple user error.我想这是简单的用户错误。 Does anyone know a fix for this?
有谁知道解决这个问题?
groupby.shift
returns a Series so your cummax
is operated on the Series not your desired SeriesGroupBy. groupby.shift
返回一个系列,因此您的cummax
是在系列上操作的,而不是您想要的 SeriesGroupBy。 You can try groupby.transform
你可以试试
groupby.transform
df['notworking'] = df.groupby('group')['end'].transform(lambda col: col.shift().cummax())
print(df)
start end group notworking
0 2022-08-02 07:20:00 2022-08-02 07:30:00 G1 NaT
1 2022-08-02 07:25:00 2022-08-02 07:35:00 G1 2022-08-02 07:30:00
2 2022-08-02 07:26:00 2022-08-02 12:34:00 G1 2022-08-02 07:35:00
3 2022-08-02 07:35:00 2022-08-02 07:40:00 G1 2022-08-02 12:34:00
4 2022-08-02 08:20:00 2022-08-02 08:30:00 G2 NaT
5 2022-08-02 08:25:00 2022-08-02 08:55:00 G2 2022-08-02 08:30:00
6 2022-08-02 08:26:00 2022-08-02 08:34:00 G2 2022-08-02 08:55:00
7 2022-08-02 08:35:00 2022-08-02 08:40:00 G2 2022-08-02 08:55:00
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.