简体   繁体   中英

Combine Pandas DataFrame Rows by Timestamp and Column

I converted below data into DataFrame which looks as below

data = [
       {"start_ts": "2018-05-14 10:54:33", "end_ts": "2018-05-14 11:54:33", "product": "a", "value": 1},
       {"start_ts": "2018-05-14 11:54:33", "end_ts": "2018-05-14 12:54:33", "product": "a", "value": 1}, 
       {"start_ts": "2018-05-14 13:54:33", "end_ts": "2018-05-14 14:54:33", "product": "a", "value": 1},          
       {"start_ts": "2018-05-14 10:54:33", "end_ts": "2018-05-14 11:54:33", "product": "b", "value": 1}
   ]

    product start_ts            end_ts              value
0   a       2018-05-14 10:54:33 2018-05-14 11:54:33 1
1   a       2018-05-14 11:54:33 2018-05-14 12:54:33 1
2   a       2018-05-14 13:54:33 2018-05-14 14:54:33 1
3   b       2018-05-14 10:54:33 2018-05-14 11:54:33 1

I'm trying to bucketize above DF rows into one row by finding contiguous timestamp fields (where start_ts is equal to the prior row's end_ts for a product) for a product and sum the value column like below.

Expected:

    product start_ts            end_ts              value
0   a       2018-05-14 10:54:33 2018-05-14 12:54:33 2
1   a       2018-05-14 13:54:33 2018-05-14 14:54:33 1
2   b       2018-05-14 10:54:33 2018-05-14 11:54:33 1

I'm unable to get the expected above value using the code below

def merge_dates(grp):
    date_groups = (grp['start_ts'] != grp['end_ts'].shift())
    return grp.groupby(date_groups).agg({'start_ts': 'first', 'end_ts': 'last'})   

df.groupby(["product"]).apply(merge_dates)

Need some advice. Any help would be greatly appreciated!

Thanks

I believe this will work:

df.groupby(['product', (df.start_ts != df.end_ts.shift()).cumsum()], \
           as_index=False).agg({'start_ts':min, 'end_ts':max, 'value':sum})

#   product              end_ts            start_ts  value
# 0       a 2018-05-14 12:54:33 2018-05-14 10:54:33      2
# 1       a 2018-05-14 14:54:33 2018-05-14 13:54:33      1
# 2       b 2018-05-14 11:54:33 2018-05-14 10:54:33      1

This approach groups by product and by a cumsum of the boolean series created by df.start_ts != df.end_ts.shift() ; the boolean series serves as a counter that increases by one each time df.start_ts does not equal the previous row's end_ts (ie df.end_ts.shift() ), thus indicating when a new group should start.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM