I have a Pandas dataframe in the following format:
id name timestamp time_diff <=30min
1 movie3 2009-05-04 18:00:00+00:00 NaN False
1 movie5 2009-05-05 18:15:00+00:00 00:15:00 True
1 movie1 2009-05-05 22:00:00+00:00 03:45:00 False
2 movie7 2009-05-04 09:30:00+00:00 NaN False
2 movie8 2009-05-05 12:00:00+00:00 02:30:00 False
3 movie1 2009-05-04 17:45:00+00:00 NaN False
3 movie7 2009-05-04 18:15:00+00:00 00:30:00 True
3 movie6 2009-05-04 18:30:00+00:00 00:15:00 True
3 movie6 2009-05-04 19:00:00+00:00 00:30:00 True
4 movie1 2009-05-05 12:45:00+00:00 NaN False
5 movie7 2009-05-04 11:00:00+00:00 NaN False
5 movie8 2009-05-04 11:15:00+00:00 00:15:00 True
The data shows the movies watched on a video streaming platform. Id is the user id, name is the name of the movie and timestamp is the timestamp at which the movie started. <30min indicates if the user has started the movie within 30minutes of the previous movie watched.
A movie-session is comprised by one or more movies played by a single user, where each movie has started within 30 minutes of the previous movie start time (Basically a session is defined as consecutive rows in which df['<30min'] == True).
The length of a session is defined as time_stamp of the last consecutive df['<30min'] == True - timestamp of the first True of the session.
How can I find the 3 longest sessions (in minutes) in the data, and the movies played during the sessions?
As a first step, I have tried something like this:
df.groupby((df['<20'] == False).cumsum())['time_diff'].fillna(pd.Timedelta(seconds=0)).cumsum()
But it doesn't work (the cumsum does not reset when df['time_diff']=False), and looks very slow.
Also, I think it would make my life harder when I have to select the longest 3 sessions as I could get multiple values for the same session that could be selected in the longest 3.
Not sure I understood you correctly. If I did then this may work; Coercer timestamp to datetime;
df['timestamp']=pd.to_datetime(df['timestamp'])
filter out the True values which indicate consecutive watch.Groupby id whicle calculating the difference between maximum and minimum time. This is then joined to the main df
df.join(df[df['<=30min']==True].groupby('id')['timestamp'].transform(lambda x:x.max()-x.min()).to_frame().rename(columns={'timestamp':'Max'}))
id name timestamp time_diff <=30min Max
0 1 movie3 2009-05-04 18:00:00+00:00 NaN False NaT
1 1 movie5 2009-05-05 18:15:00+00:00 00:15:00 True 00:00:00
2 1 movie1 2009-05-05 22:00:00+00:00 03:45:00 False NaT
3 2 movie7 2009-05-04 09:30:00+00:00 NaN False NaT
4 2 movie8 2009-05-05 12:00:00+00:00 02:30:00 False NaT
5 3 movie1 2009-05-04 17:45:00+00:00 NaN False NaT
6 3 movie7 2009-05-04 18:15:00+00:00 00:30:00 True 00:45:00
7 3 movie6 2009-05-04 18:30:00+00:00 00:15:00 True 00:45:00
8 3 movie6 2009-05-04 19:00:00+00:00 00:30:00 True 00:45:00
9 4 movie1 2009-05-05 12:45:00+00:00 NaN False NaT
10 5 movie7 2009-05-04 11:00:00+00:00 NaN False NaT
11 5 movie8 2009-05-04 11:15:00+00:00 00:15:00 True 00:00:00
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.