[英]Compare current row value to previous row values
I have login history data from User A for a day.我有来自用户 A 一天的登录历史数据。 My requirement is that at any point in time the User A can have only one valid login.
我的要求是用户 A 在任何时候都只能有一个有效的登录名。 As in the samples below, the user may have attempted to login successfully multiple times, while his first session was still active.
在下面的示例中,用户可能多次尝试成功登录,而他的第一个会话仍处于活动状态。 So, any logins that happened during the valid session needs to be flagged as duplicate.
因此,在有效会话期间发生的任何登录都需要标记为重复。
Example 1:示例 1:
In the first sample data below, while the user was still logged in from 00:12:38
to 01:00:02 (index 0)
, there is another login from the user at 00:55:14
to 01:00:02 (index 1)
.在下面的第一个示例数据中,虽然用户仍然从
00:12:38
到01:00:02 (index 0)
登录,但用户在00:55:14
到01:00:02 (index 1)
有另一个登录01:00:02 (index 1)
。
Similarly, if we compare index 2
and 3
, we can see that the record at index 3
is duplicate login as per requirement.同样,如果我们比较
index 2
和3
,我们可以看到index 3
处的记录是按要求重复登录。
start_time end_time
0 00:12:38 01:00:02
1 00:55:14 01:00:02
2 01:00:02 01:32:40
3 01:00:02 01:08:40
4 01:41:22 03:56:23
5 18:58:26 19:16:49
6 20:12:37 20:52:49
7 20:55:16 22:02:50
8 22:21:24 22:48:50
9 23:11:30 00:00:00
Expected output:预期输出:
start_time end_time isDup
0 00:12:38 01:00:02 0
1 00:55:14 01:00:02 1
2 01:00:02 01:32:40 0
3 01:00:02 01:08:40 1
4 01:41:22 03:56:23 0
5 18:58:26 19:16:49 0
6 20:12:37 20:52:49 0
7 20:55:16 22:02:50 0
8 22:21:24 22:48:50 0
9 23:11:30 00:00:00 0
These duplicate records need to be updated to 1 at column isDup
.这些重复记录需要在
isDup
列更新为 1。
Example 2:示例 2:
Another sample of data as below.另一个数据示例如下。 Here, while the user was still logged in between
13:36:10
and 13:50:16
, there were 3 additional sessions too that needs to be flagged.在这里,虽然用户仍然在
13:36:10
和13:50:16
之间登录,但还有 3 个额外的会话需要标记。
start_time end_time
0 13:32:54 13:32:55
1 13:36:10 13:50:16
2 13:37:54 13:38:14
3 13:46:38 13:46:45
4 13:48:59 13:49:05
5 13:50:16 13:50:20
6 14:03:39 14:03:49
7 15:36:20 15:36:20
8 15:46:47 15:46:47
Expected output:预期输出:
start_time end_time isDup
0 13:32:54 13:32:55 0
1 13:36:10 13:50:16 0
2 13:37:54 13:38:14 1
3 13:46:38 13:46:45 1
4 13:48:59 13:49:05 1
5 13:50:16 13:50:20 0
6 14:03:39 14:03:49 0
7 15:36:20 15:36:20 0
8 15:46:47 15:46:47 0
What's the efficient way to compare the start time of the current record with previous records?将当前记录的开始时间与以前的记录进行比较的有效方法是什么?
Query duplicated()
and change astype to int
查询
duplicated()
并将 astype 更改为int
df['isDup']=(df['Start time'].duplicated(False)|df['End time'].duplicated(False)).astype(int)
Or did you need或者你需要
df['isDup']=(df['Start time'].between(df['Start time'].shift(),df['End time'].shift())).astype(int)
Map the time
like values in columns start_time
and end_time
to pandas TimeDelta
objects and subtract 1 seconds
from the 00:00:00
timedelta values in end_time
column.将
start_time
和end_time
列中的类似time
值end_time
到TimeDelta
对象,并从end_time
列中的00:00:00
timedelta 值中减去1 seconds
。
c = ['start_time', 'end_time']
s, e = df[c].astype(str).apply(pd.to_timedelta).to_numpy().T
e[e == pd.Timedelta(0)] += pd.Timedelta(days=1, seconds=-1)
Then for each pair of start_time
and end_time
in the dataframe df
mark the corresponding duplicate intervals using numpy broadcasting
:然后对于数据帧
df
中的每一对start_time
和end_time
使用numpy broadcasting
标记相应的重复间隔:
m = (s[:, None] >= s) & (e[:, None] <= e)
np.fill_diagonal(m, False)
df['isDupe'] = (m.any(1) & ~df[c].duplicated(keep=False)).view('i1')
# example 1
start_time end_time isDupe
0 00:12:38 01:00:02 0
1 00:55:14 01:00:02 1
2 01:00:02 01:32:40 0
3 01:00:02 01:08:40 1
4 01:41:22 03:56:23 0
5 18:58:26 19:16:49 0
6 20:12:37 20:52:49 0
7 20:55:16 22:02:50 0
8 22:21:24 22:48:50 0
9 23:11:30 00:00:00 0
# example 2
start_time end_time isDupe
0 13:32:54 13:32:55 0
1 13:36:10 13:50:16 0
2 13:37:54 13:38:14 1
3 13:46:38 13:46:45 1
4 13:48:59 13:49:05 1
5 13:50:16 13:50:20 0
6 14:03:39 14:03:49 0
7 15:36:20 15:36:20 0
8 15:46:47 15:46:47 0
Here's my solution to the above question.这是我对上述问题的解决方案。 However, if there are any efficient way, I would be happy to accept it.
但是,如果有任何有效的方法,我会很乐意接受。 Thanks!
谢谢!
def getDuplicate(data):
data['check_time'] = data.iloc[-1]['start_time']
data['isDup'] = data.apply(lambda x: 1
if (x['start_time'] <= x['check_time']) & (x['check_time'] < x['end_time'])
else 0
, axis = 1)
return data['isDup'].sum()
limit = 1
df_copy = df.copy()
df['isDup'] = 0
for i, row in df.iterrows():
data = df_copy.iloc[:limit]
isDup = getDuplicate(data)
limit = limit + 1
if isDup > 1:
df.at[i, 'isDup'] = 1
else:
df.at[i, 'isDup'] = 0
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.