简体   繁体   中英

Cumulative sum of Timedelta column based on boolean condition

I have a Pandas dataframe in the following format:

id  name   timestamp                   time_diff <=30min
1   movie3 2009-05-04 18:00:00+00:00        NaN  False
1   movie5 2009-05-05 18:15:00+00:00   00:15:00  True
1   movie1 2009-05-05 22:00:00+00:00   03:45:00  False
2   movie7 2009-05-04 09:30:00+00:00        NaN  False
2   movie8 2009-05-05 12:00:00+00:00   02:30:00  False
3   movie1 2009-05-04 17:45:00+00:00        NaN  False
3   movie7 2009-05-04 18:15:00+00:00   00:30:00  True
3   movie6 2009-05-04 18:30:00+00:00   00:15:00  True
3   movie6 2009-05-04 19:00:00+00:00   00:30:00  True
4   movie1 2009-05-05 12:45:00+00:00        NaN  False
5   movie7 2009-05-04 11:00:00+00:00        NaN  False
5   movie8 2009-05-04 11:15:00+00:00   00:15:00  True

The data shows the movies watched on a video streaming platform. Id is the user id, name is the name of the movie and timestamp is the timestamp at which the movie started. <30min indicates if the user has started the movie within 30minutes of the previous movie watched.

A movie-session is comprised by one or more movies played by a single user, where each movie has started within 30 minutes of the previous movie start time (Basically a session is defined as consecutive rows in which df['<30min'] == True).

The length of a session is defined as time_stamp of the last consecutive df['<30min'] == True - timestamp of the first True of the session.

How can I find the 3 longest sessions (in minutes) in the data, and the movies played during the sessions?

As a first step, I have tried something like this:

df.groupby((df['<20'] == False).cumsum())['time_diff'].fillna(pd.Timedelta(seconds=0)).cumsum()

But it doesn't work (the cumsum does not reset when df['time_diff']=False), and looks very slow.

Also, I think it would make my life harder when I have to select the longest 3 sessions as I could get multiple values for the same session that could be selected in the longest 3.

Not sure I understood you correctly. If I did then this may work; Coercer timestamp to datetime;

df['timestamp']=pd.to_datetime(df['timestamp'])

filter out the True values which indicate consecutive watch.Groupby id whicle calculating the difference between maximum and minimum time. This is then joined to the main df

df.join(df[df['<=30min']==True].groupby('id')['timestamp'].transform(lambda x:x.max()-x.min()).to_frame().rename(columns={'timestamp':'Max'}))

    id    name                 timestamp time_diff  <=30min      Max
0    1  movie3 2009-05-04 18:00:00+00:00       NaN    False      NaT
1    1  movie5 2009-05-05 18:15:00+00:00  00:15:00     True 00:00:00
2    1  movie1 2009-05-05 22:00:00+00:00  03:45:00    False      NaT
3    2  movie7 2009-05-04 09:30:00+00:00       NaN    False      NaT
4    2  movie8 2009-05-05 12:00:00+00:00  02:30:00    False      NaT
5    3  movie1 2009-05-04 17:45:00+00:00       NaN    False      NaT
6    3  movie7 2009-05-04 18:15:00+00:00  00:30:00     True 00:45:00
7    3  movie6 2009-05-04 18:30:00+00:00  00:15:00     True 00:45:00
8    3  movie6 2009-05-04 19:00:00+00:00  00:30:00     True 00:45:00
9    4  movie1 2009-05-05 12:45:00+00:00       NaN    False      NaT
10   5  movie7 2009-05-04 11:00:00+00:00       NaN    False      NaT
11   5  movie8 2009-05-04 11:15:00+00:00  00:15:00     True 00:00:00

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM