简体   繁体   English

Pandas groupby 并在组之间转移溢出

[英]Pandas groupby and shift spilling between groups

I'm having an issue with Pandas where the combination of groupby and shift seems to have data spilled between the groups.我遇到了 Pandas 的问题,其中 groupby 和 shift 的组合似乎在组之间溢出了数据。

Here's a reproducible example:这是一个可重现的示例:

from pandas import Timestamp

sample = {'start': {0: Timestamp('2022-08-02 07:20:00'),
      1: Timestamp('2022-08-02 07:25:00'),
      2: Timestamp('2022-08-02 07:26:00'),
      3: Timestamp('2022-08-02 07:35:00'),
      4: Timestamp('2022-08-02 08:20:00'),
      5: Timestamp('2022-08-02 08:25:00'),
      6: Timestamp('2022-08-02 08:26:00'),
      7: Timestamp('2022-08-02 08:35:00')},
     'end': {0: Timestamp('2022-08-02 07:30:00'),
      1: Timestamp('2022-08-02 07:35:00'),
      2: Timestamp('2022-08-02 12:34:00'),
      3: Timestamp('2022-08-02 07:40:00'),
      4: Timestamp('2022-08-02 08:30:00'),
      5: Timestamp('2022-08-02 08:55:00'),
      6: Timestamp('2022-08-02 08:34:00'),
      7: Timestamp('2022-08-02 08:40:00')},
     'group': {0: 'G1',
      1: 'G1',
      2: 'G1',
      3: 'G1',
      4: 'G2',
      5: 'G2',
      6: 'G2',
      7: 'G2'}}

df = pd.DataFrame(sample)
df = df.sort_values('start')

df['notworking'] = df.groupby('group')['end'].shift().cummax()

This gives the following output这给出了以下 output

    start               end                 group   notworking
0   2022-08-02 07:20:00 2022-08-02 07:30:00 G1  
1   2022-08-02 07:25:00 2022-08-02 07:35:00 G1      2022-08-02 07:30:00
2   2022-08-02 07:26:00 2022-08-02 12:34:00 G1      2022-08-02 07:35:00
3   2022-08-02 07:35:00 2022-08-02 07:40:00 G1      2022-08-02 12:34:00
4   2022-08-02 08:20:00 2022-08-02 08:30:00 G2  
5   2022-08-02 08:25:00 2022-08-02 08:55:00 G2      2022-08-02 12:34:00
6   2022-08-02 08:26:00 2022-08-02 08:34:00 G2      2022-08-02 12:34:00
7   2022-08-02 08:35:00 2022-08-02 08:40:00 G2      2022-08-02 12:34:00

The 'end' at index 2 is correctly assigned to 'notworking' at index 3, but this value persists over in the next group.索引 2 处的'end'被正确分配给索引 3 处'notworking' ,但该值在下一组中仍然存在。

My desired outcome is for cummax() to start fresh for each group, like this:我想要的结果是让 cummax() 为每个组重新开始,如下所示:

    start               end                 group   notworking
0   2022-08-02 07:20:00 2022-08-02 07:30:00 G1  
1   2022-08-02 07:25:00 2022-08-02 07:35:00 G1      2022-08-02 07:30:00
2   2022-08-02 07:26:00 2022-08-02 12:34:00 G1      2022-08-02 07:35:00
3   2022-08-02 07:35:00 2022-08-02 07:40:00 G1      2022-08-02 12:34:00
4   2022-08-02 08:20:00 2022-08-02 08:30:00 G2  
5   2022-08-02 08:25:00 2022-08-02 08:55:00 G2      2022-08-02 08:30:00
6   2022-08-02 08:26:00 2022-08-02 08:34:00 G2      2022-08-02 08:55:00
7   2022-08-02 08:35:00 2022-08-02 08:40:00 G2      2022-08-02 08:55:00

I guess this is simple user error.我想这是简单的用户错误。 Does anyone know a fix for this?有谁知道解决这个问题?

groupby.shift returns a Series so your cummax is operated on the Series not your desired SeriesGroupBy. groupby.shift返回一个系列,因此您的cummax是在系列上操作的,而不是您想要的 SeriesGroupBy。 You can try groupby.transform你可以试试groupby.transform

df['notworking'] = df.groupby('group')['end'].transform(lambda col: col.shift().cummax())
print(df)

                start                 end group          notworking
0 2022-08-02 07:20:00 2022-08-02 07:30:00    G1                 NaT
1 2022-08-02 07:25:00 2022-08-02 07:35:00    G1 2022-08-02 07:30:00
2 2022-08-02 07:26:00 2022-08-02 12:34:00    G1 2022-08-02 07:35:00
3 2022-08-02 07:35:00 2022-08-02 07:40:00    G1 2022-08-02 12:34:00
4 2022-08-02 08:20:00 2022-08-02 08:30:00    G2                 NaT
5 2022-08-02 08:25:00 2022-08-02 08:55:00    G2 2022-08-02 08:30:00
6 2022-08-02 08:26:00 2022-08-02 08:34:00    G2 2022-08-02 08:55:00
7 2022-08-02 08:35:00 2022-08-02 08:40:00    G2 2022-08-02 08:55:00

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM