简体   繁体   中英

GroupBy and aggregate function in Pandas

I have a time series dataset as below. I would like to split this into multiple 20s bins, get the min and max timestamps in each bin and add a flag to each bin based on whether there is at least 1 successful result (success: result = 0; failed: result = 1)

data = [{"product": "abc", "test_tstamp": 1530693399, "result": 1},
    {"product": "abc", "test_tstamp": 1530693405, "result": 0},
    {"product": "abc", "test_tstamp": 1530693410, "result": 1},
    {"product": "abc", "test_tstamp": 1530693411, "result": 0},
    {"product": "abc", "test_tstamp": 1530693415, "result": 0},
    {"product": "abc", "test_tstamp": 1530693420, "result": 0},
    {"product": "abc", "test_tstamp": 1530693430, "result": 0},
    {"product": "abc", "test_tstamp": 1530693431, "result": 0}]

I'm able to cut the data into 20s intervals using pandas.cut()and get the min and max timestamps for each bin

import numpy as np
import pandas as pd
arange = np.arange(1530693398, 1530693440, 20)
data = [{"product": "abc", "test_tstamp": 1530693399, "result": 1},
    {"product": "abc", "test_tstamp": 1530693405, "result": 0},
    {"product": "abc", "test_tstamp": 1530693410, "result": 1},
    {"product": "abc", "test_tstamp": 1530693411, "result": 0},
    {"product": "abc", "test_tstamp": 1530693415, "result": 0},
    {"product": "abc", "test_tstamp": 1530693420, "result": 1},
    {"product": "abc", "test_tstamp": 1530693430, "result": 1},
    {"product": "abc", "test_tstamp": 1530693431, "result": 1}]
df = pd.DataFrame(data)
df['bins'] = pd.cut(df['test_tstamp'], arange)
output_1 = df.groupby(["bins"]).agg({'result': np.ma.count, 'test_tstamp': {'mindate': np.min, 'maxdate': np.max}})

                         test_tstamp               result
                         maxdate     mindate       count
bins                                                   
(1530693398, 1530693418]  1530693415  1530693399      5
(1530693418, 1530693438]  1530693431  1530693420      3

and able to find result success and result failed using groupby()

output_2 = df.groupby(["bins", "result"]).result.count()
                                     result
 bins                     result        
 (1530693398, 1530693418] 0            3
                          1            2
 (1530693418, 1530693438] 0            3

I'm not sure how to combine output_1 and output_2 so that instead of result count column above, I would like to have result success , result failed and flag columns associated with each bin .

Expected Output:

                             test_tstamp               result    flag
                         maxdate     mindate      success failed  
bins                                                   
(1530693398, 1530693418]  1530693415  1530693399  3         2     True
(1530693418, 1530693438]  1530693431  1530693420  0         3    False

Any pointers would help! Thank you!

Unstack outptut_2 and then concatenate the two outputs:

output_2 = (
    output_2
       .unstack(fill_value=0)
       .rename(columns={0 : 'success', 1 : 'failed'}))

df = (pd.concat([output_1.test_tstamp, output_2], axis=1, keys=['test_tstamp', 'result'])
        .assign(flag=output_2.success.gt(0)))

                         test_tstamp              result          flag
result                       mindate     maxdate success failed       
bins                                                                  
(1530693398, 1530693418]  1530693399  1530693415       3      2   True
(1530693418, 1530693438]  1530693420  1530693431       0      3  False

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM