[英]split login session into shift buckets
我有一个用户登录和注销表。
该表看起来像这样,但有几十万行:
data = [['aa', '2020-05-31 00:00:01', '2020-05-31 00:00:31'],
['bb','2020-05-31 00:01:01', '2020-05-31 00:02:01'],
['aa','2020-05-31 00:02:01', '2020-05-31 00:06:03'],
['cc','2020-05-31 00:03:01', '2020-05-31 00:04:01'],
['dd','2020-05-31 00:04:01', '2020-05-31 00:34:01'],
['aa', '2020-05-31 00:05:01', '2020-05-31 00:07:31'],
['bb','2020-05-31 00:05:01', '2020-05-31 00:06:01'],
['aa','2020-05-31 22:05:01', '2020-06-31 09:08:03'],
['cc','2020-05-31 22:10:01', '2020-06-31 09:40:01'],
['dd','2020-05-31 00:20:01', '2020-05-31 15:35:01']]
df_test = pd.DataFrame(data, columns=['user_id','login', 'logout'], dtype='datetime64[ns]')
我需要知道每个 session 在 4 个不同班次中花费了多少时间:
晚上(12am 到 6am),早上(6am 到 12pm),下午(12pm 到 6pm),晚上(6pm 到 12am)
我能够解决这个问题(下面的代码),但是一些 session 跨越多天,如果班次从晚上 10 点开始到第二天早上 9 点结束,我的脚本将无法正确分配时间。
我不确定在 python 中是否有针对此类问题的适当算法。
这是我的代码:
shifting = df_test.copy()
# extracting day from each datetime. We will use it to dynamically create shifts for each loop iteration
shifting['day'] = shifting['login'].dt.floor("D")
# adding 4 empty columns to the data, 1 for each shift
shifting['night'] = ''
shifting['morning'] = ''
shifting['afternoon'] = ''
shifting['evening'] = ''
# writing logic to properly split time between shifts if needed
def time_in_shift(start, end, shift_start, shift_end):
"""
Properly splits time between shifts if needed.
The logic is as follows: if the user logs in before the actual the shift start time -> shift's start time takes place of the login time.
if the user logs out after the shifts end time -> shift's end time takes place of the logout time. This logic is not perfect as sessions can span over
multiple days. This function accounts for that by equally splitting the time in 4 if a session is longer than 24h. Need a bit more time to figure out the rest.
Args:
start (datetime): login timestamp.
end (datetime): logout timestamp.
shift_start (datetime): start time of a shift.
shift_end (datetime): end time of a shift.
Returns:
hours spent in each shift (numeric)
"""
# first condition: if the session is longer than 24h -> split evenly between 4 shifts
if (end - start).total_seconds()/3600 > 24:
return (end - start).total_seconds()/3600/4
# if not -> follow the logic outlined in the description of this function
else:
if start < shift_start:
start = shift_start
if end > shift_end:
end = shift_end
# calculating time spent in the session here (in hours)
time_spent = (end-start).total_seconds()/3600
# negative hours means that no time was spent in that shift -> turn to 0
if time_spent < 0:
time_spent = 0
return time_spent
# applying the time_in_shift function to each row of the connections dataset (now shifting)
for i in shifting.index:
# dynamically creating shifts for each session. Must be done because dates are always different.
shift_start=(shifting.loc[i,'day'],
shifting.loc[i,'day'] + timedelta(hours = 6),
shifting.loc[i,'day'] + timedelta(hours = 12),
shifting.loc[i,'day'] + timedelta(hours = 18))
shift_end= (shift_start[1],
shift_start[2],
shift_start[3],
shift_start[0] + timedelta(days=1))
# range here corresponds to 4 shifts
for shift in range(4):
# storing time in the shift_time variable
shift_time = time_in_shift(shifting.loc[i,'login'], shifting.loc[i,'logout'], shift_start[shift], shift_end[shift])
如果您知道如何做得更好,请告诉我。 提前致谢!
如果我理解您正在尝试调整轮班时间?
df = pd.DataFrame(data, columns=['user_id', "login", "logout"], dtype="datetime64[ns]")
df["delta_hours"] = (df["logout"] - df["login"]).dt.seconds / 3600
bins = [0, 6, 12, 18, 24]
labels = ["night", "morning", "afternoon", "evening"]
df = (
df
.groupby(["user_id", pd.cut(df["login"].dt.hour, bins=bins, labels=labels, right=False)])["delta_hours"]
.sum()
.unstack()
.rename_axis(None, axis=1)
.reset_index()
)
print(df)
user_id night morning afternoon evening
0 aa 0.117222 0.0 0.0 11.050556
1 bb 0.033333 0.0 0.0 0.000000
2 cc 0.016667 0.0 0.0 11.500000
3 dd 15.750000 0.0 0.0 0.000000
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.