简体   繁体   中英

Pandas dataframe get all rows between zero(0) of mask column and get first and last row of each group

I have a dataframe like this.

   store daiban  signal  ...          start_time            end_time mask
0   0901   0001       0  ... 2020-03-31 00:00:00 2020-03-31 00:35:00    0
1   0901   0001       1  ... 2020-03-31 00:35:00 2020-03-31 00:36:40    1
2   0901   0001       2  ... 2020-03-31 00:36:40 2020-03-31 00:38:44    1
3   0901   0001       0  ... 2020-03-31 00:38:44 2020-03-31 01:10:40    0
4   0901   0001       1  ... 2020-03-31 01:10:40 2020-03-31 01:12:24    1
5   0901   0001       2  ... 2020-03-31 01:12:24 2020-03-31 01:13:40    1
6   0901   0001       1  ... 2020-03-31 01:13:40 2020-03-31 01:15:04    1
7   0901   0001       2  ... 2020-03-31 01:15:04 2020-03-31 01:17:00    1
8   0901   0001       0  ... 2020-03-31 01:17:00 2020-03-31 02:33:04    0
9   0901   0001       1  ... 2020-03-31 02:33:04 2020-03-31 02:34:52    1
10  0901   0001       2  ... 2020-03-31 02:34:52 2020-03-31 02:37:28    1

I want to get all rows between zero(0) of mask column and get first row's start_time and last row's end_time of each group.

for example

1) first group will be index 1 to 2.

1   0901   0001       1  ... 2020-03-31 00:35:00 2020-03-31 00:36:40    1
2   0901   0001       2  ... 2020-03-31 00:36:40 2020-03-31 00:38:44    1

2) get first row's start_time and last row's end_time of each group

0   0901   0001     2020-03-31 00:35:00  2020-03-31 00:38:44    

expected output

   store daiban        start_time            end_time 
0   0901   0001     2020-03-31 00:35:00  2020-03-31 00:38:44    
1   0901   0001     2020-03-31 01:10:40  2020-03-31 01:17:00
2   0901   0001     2020-03-31 02:33:04  2020-03-31 02:37:28

dataframe for reproduce the example

from pandas import Timestamp
df = pd.DataFrame.from_dict({'store': {0: '0901',
  1: '0901',
  2: '0901',
  3: '0901',
  4: '0901',
  5: '0901',
  6: '0901',
  7: '0901',
  8: '0901',
  9: '0901',
  10: '0901'},
 'daiban': {0: '0001',
  1: '0001',
  2: '0001',
  3: '0001',
  4: '0001',
  5: '0001',
  6: '0001',
  7: '0001',
  8: '0001',
  9: '0001',
  10: '0001'},
 'signal': {0: 0, 1: 1, 2: 2, 3: 0, 4: 1, 5: 2, 6: 1, 7: 2, 8: 0, 9: 1, 10: 2},
 'cum_sum': {0: 525,
  1: 25,
  2: 31,
  3: 479,
  4: 26,
  5: 19,
  6: 21,
  7: 29,
  8: 1141,
  9: 27,
  10: 39},
 'seconds': {0: 2100,
  1: 100,
  2: 124,
  3: 1916,
  4: 104,
  5: 76,
  6: 84,
  7: 116,
  8: 4564,
  9: 108,
  10: 156},
 'start_time': {0: Timestamp('2020-03-31 00:00:00'),
  1: Timestamp('2020-03-31 00:35:00'),
  2: Timestamp('2020-03-31 00:36:40'),
  3: Timestamp('2020-03-31 00:38:44'),
  4: Timestamp('2020-03-31 01:10:40'),
  5: Timestamp('2020-03-31 01:12:24'),
  6: Timestamp('2020-03-31 01:13:40'),
  7: Timestamp('2020-03-31 01:15:04'),
  8: Timestamp('2020-03-31 01:17:00'),
  9: Timestamp('2020-03-31 02:33:04'),
  10: Timestamp('2020-03-31 02:34:52')},
 'end_time': {0: Timestamp('2020-03-31 00:35:00'),
  1: Timestamp('2020-03-31 00:36:40'),
  2: Timestamp('2020-03-31 00:38:44'),
  3: Timestamp('2020-03-31 01:10:40'),
  4: Timestamp('2020-03-31 01:12:24'),
  5: Timestamp('2020-03-31 01:13:40'),
  6: Timestamp('2020-03-31 01:15:04'),
  7: Timestamp('2020-03-31 01:17:00'),
  8: Timestamp('2020-03-31 02:33:04'),
  9: Timestamp('2020-03-31 02:34:52'),
  10: Timestamp('2020-03-31 02:37:28')},
 'mask': {0: 0, 1: 1, 2: 1, 3: 0, 4: 1, 5: 1, 6: 1, 7: 1, 8: 0, 9: 1, 10: 1}})

IIUC we use cumsum and filter created he dataframe then use agg

df=df.loc[df['mask'].ne(0)].groupby([df['mask'].eq(0).cumsum(),df.store,df.daiban]).\
   agg({'start_time':'first','end_time':'last'}).reset_index(level=[1,2])
   mask store daiban          start_time            end_time
0     1  0901   0001 2020-03-31 00:35:00 2020-03-31 00:38:44
1     2  0901   0001 2020-03-31 01:10:40 2020-03-31 01:17:00
2     3  0901   0001 2020-03-31 02:33:04 2020-03-31 02:37:28

The key here is to create a group column for each 1 groups This can be done with cumsum

df['group'] = (1-df['mask']).cumsum()

Then keep only rows where mask is 1

df_filter = df.loc[df['mask']==1]

You can then group by group, store and daiban to get one row by group and compute min start_time and max end_time in each group.

df_filter.groupby(['group','store','daiban'])\
         .agg({'start_time': 'min', 'end_time': 'max'})\
         .reset_index()\
         .drop('group',axis=1)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM