简体   繁体   English

pandas dataframe 中特定行对之间的总时间增量

[英]Sum timedeltas between specific pairs of rows in a pandas dataframe

have been wrestling with this for a while and can't figure it out.已经为此苦苦挣扎了一段时间,无法弄清楚。

I've got some logs of user actions when watching a live broadcast on our product, and I need to be able to get a picture of the total time a user was watching the broadcast, subtracting any time they had the stream paused.在我们的产品上观看直播时,我有一些用户操作日志,我需要能够了解用户观看直播的总时间,减去他们暂停 stream 的任何时间。

my dataframe looks like this (after some filtering)我的 dataframe 看起来像这样(经过一些过滤)

                dateHourMinute             event      user
2    2020-05-01 14:35:00+01:00              play  clqj9026
5811 2020-05-01 14:45:00+01:00             pause  clqj9026 # -- exclude this
5812 2020-05-01 15:00:00+01:00              play  clqj9026 # -- timedelta
5846 2020-05-01 15:01:00+01:00              play  clqj9026
6147 2020-05-01 15:07:00+01:00             pause  clqj9026
6148 2020-05-01 15:07:00+01:00              play  clqj9026
6354 2020-05-01 15:20:00+01:00             pause  clqj9026
6355 2020-05-01 15:20:00+01:00              play  clqj9026
6392 2020-05-01 15:21:00+01:00              play  clqj9026
6505 2020-05-01 15:23:00+01:00             pause  clqj9026
6506 2020-05-01 15:23:00+01:00  stopped_watching  clqj9026

I want to sum the timedeltas between each pair of 'play/pause' events but avoid including gaps between pause/play events, assuming that the user had the stream closed at this point.我想总结每对“播放/暂停”事件之间的时间增量,但避免包括暂停/播放事件之间的间隙,假设用户此时已关闭 stream。

The example shows contiguous events but we have to assume that there are instances where the stream was paused and the user was doing something else.该示例显示连续事件,但我们必须假设存在 stream 暂停且用户正在执行其他操作的实例。 Also, I need to disregard instances of the same event occurring twice in sequence.此外,我需要忽略连续发生两次的同一事件的实例。 I know I can do df.dateHourMinute.diff().sum() but this doesn't take into account the periods when the stream would be paused.我知道我可以做df.dateHourMinute.diff().sum()但这没有考虑 stream 暂停的时间段。

Secondly, is there a way to do this without iterating over the unique values in the user column to get the total viewing time per-user?#其次,有没有一种方法可以在不遍历user列中的唯一值来获取每个用户的总观看时间的情况下做到这一点?#

EDIT: Changed the table above to show a gap where the stream was paused.编辑:更改了上表以显示 stream 暂停的间隙。 To clarify the total view time for the table above should come out at 33 minutes (Note the period between the first 'pause' at 14:45 and the second 'play' event at 15:00, I want to exclude that time period).为了澄清上表的总观看时间应该在 33 分钟时出现(请注意 14:45 的第一次“暂停”和 15:00 的第二次“播放”事件之间的时间段,我想排除该时间段) .

try this:试试这个:

df['dateHourMinute'] = pd.to_datetime(df['dateHourMinute'])
df = df.sort_values('dateHourMinute')
df['time_diff'] = df['dateHourMinute'].shift(-1) - df['dateHourMinute']
df = df[df['event']=='play']
print(df['time_diff'].sum())

The first line converts dateHourMinute to date time.第一行将dateHourMinute转换为日期时间。 The second line sorts the data by time.第二行按时间对数据进行排序。 The third line subtracts time between two consecutive rows.第三行减去两个连续行之间的时间。 Now you have time between each play and pause.现在您在每次播放和暂停之间都有时间。 Now you can do anything you want with the data!现在你可以用数据做任何你想做的事了! the last line adds all the time_diff s.最后一行添加了所有的time_diff s。 which for this data is 0 days 00:48:00此数据为0 days 00:48:00

Let me know if it is helpful.让我知道它是否有帮助。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM