简体   繁体   中英

Pandas - groupby columns with conditions from another column

I am struggling with pandas regarding how to group multiple column values with conditions:

Here is how my data looks like as a pandas dataframe:

id      trigger     timestamp
1       started     2017-10-01 14:00:1
1       ended       2017-10-04 12:00:1
2       started     2017-10-02 10:00:1
1       started     2017-10-03 11:00:1
2       ended       2017-10-04 12:00:1    
2       started     2017-10-05 15:00:1
1       ended       2017-10-05 16:00:1
2       ended       2017-10-05 17:00:1

My goal is to find the difference in day/hour or minutes between the dates grouped by the id.

My output should look more like this (diff in hrs):

id      trigger     timestamp           trigger     timestamp               diff
1       started     2017-10-01 14:00:1  ended       2017-10-04 12:00:1      70
1       started     2017-10-03 11:00:1  ended       2017-10-05 16:00:1      53
2       started     2017-10-02 10:00:1  ended       2017-10-04 12:00:1      26
2       started     2017-10-05 15:00:1  ended       2017-10-05 17:00:1      2

I have tried many options, but I can not the most efficient solution.

Here is my code until now:

First I tried to split the data in 'started' and 'ended':

df['started'] = df.groupby(['id', 'timestamp'])['trigger'] == 'started'

df['ended'] = df.groupby(['id', 'timestamp'])['trigger'] == 'ended'

and then:

df.groupby(['id', 'started', 'ended'], as_index=True).sum()

but it dind't work. or

df['started'] = df.groupby(['trigger'])['timestamp'].np.where(df['trigger']=='started')

also without gut results.

Can some point in the right direction how to do this with pandas? I will also have null matches in the data, how can I use df.fillna(method='ffill') to add NaN or missing data to the new dataframe.

  1. Set id and trigger as the index
  2. Since the index contains duplicate entries, append another index column with the groupwise cumcount. Totally, df must have a MultiIndex with 3 columns
  3. unstack on timestamp
  4. Find the difference between the columns hourwise and assign the result back

df['timestamp'] = pd.to_datetime(df['timestamp']) # if necessary

i = df.groupby(['id', 'trigger']).cumcount()
df.set_index(['id', i, 'trigger']).timestamp.unstack().assign(
       diff=lambda d: d.ended.sub(d.started).dt.total_seconds() / 3600
)

Thanks to piRSquared for the improvement.

v

                  timestamp                      diff
trigger               ended             started      
id                                                   
1  0    2017-10-04 12:00:01 2017-10-01 14:00:01  70.0
   1    2017-10-05 16:00:01 2017-10-03 11:00:01  53.0
2  0    2017-10-04 12:00:01 2017-10-02 10:00:01  50.0
   1    2017-10-05 17:00:01 2017-10-05 15:00:01   2.0

The result is not exactly as depicted in your question, but I believe a MultiIndex of columns would be a cleaner way of representing your output instead of two trigger columns.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM