简体   繁体   English

将当前行值与前一行值进行比较

[英]Compare current row value to previous row values

I have login history data from User A for a day.我有来自用户 A 一天的登录历史数据。 My requirement is that at any point in time the User A can have only one valid login.我的要求是用户 A 在任何时候都只能有一个有效的登录名。 As in the samples below, the user may have attempted to login successfully multiple times, while his first session was still active.在下面的示例中,用户可能多次尝试成功登录,而他的第一个会话仍处于活动状态。 So, any logins that happened during the valid session needs to be flagged as duplicate.因此,在有效会话期间发生的任何登录都需要标记为重复。

Example 1:示例 1:

In the first sample data below, while the user was still logged in from 00:12:38 to 01:00:02 (index 0) , there is another login from the user at 00:55:14 to 01:00:02 (index 1) .在下面的第一个示例数据中,虽然用户仍然从00:12:3801:00:02 (index 0)登录,但用户在00:55:1401:00:02 (index 1)有另一个登录01:00:02 (index 1)

Similarly, if we compare index 2 and 3 , we can see that the record at index 3 is duplicate login as per requirement.同样,如果我们比较index 23 ,我们可以看到index 3处的记录是按要求重复登录。

  start_time  end_time
0   00:12:38  01:00:02
1   00:55:14  01:00:02
2   01:00:02  01:32:40
3   01:00:02  01:08:40
4   01:41:22  03:56:23
5   18:58:26  19:16:49
6   20:12:37  20:52:49
7   20:55:16  22:02:50
8   22:21:24  22:48:50
9   23:11:30  00:00:00

Expected output:预期输出:

  start_time  end_time   isDup
0   00:12:38  01:00:02       0
1   00:55:14  01:00:02       1
2   01:00:02  01:32:40       0
3   01:00:02  01:08:40       1
4   01:41:22  03:56:23       0
5   18:58:26  19:16:49       0
6   20:12:37  20:52:49       0
7   20:55:16  22:02:50       0
8   22:21:24  22:48:50       0
9   23:11:30  00:00:00       0

These duplicate records need to be updated to 1 at column isDup .这些重复记录需要在isDup列更新为 1。


Example 2:示例 2:

Another sample of data as below.另一个数据示例如下。 Here, while the user was still logged in between 13:36:10 and 13:50:16 , there were 3 additional sessions too that needs to be flagged.在这里,虽然用户仍然在13:36:1013:50:16之间登录,但还有 3 个额外的会话需要标记。

  start_time  end_time
0   13:32:54  13:32:55
1   13:36:10  13:50:16
2   13:37:54  13:38:14
3   13:46:38  13:46:45
4   13:48:59  13:49:05
5   13:50:16  13:50:20
6   14:03:39  14:03:49
7   15:36:20  15:36:20
8   15:46:47  15:46:47

Expected output:预期输出:

  start_time    end_time    isDup
0   13:32:54    13:32:55    0
1   13:36:10    13:50:16    0
2   13:37:54    13:38:14    1
3   13:46:38    13:46:45    1
4   13:48:59    13:49:05    1
5   13:50:16    13:50:20    0
6   14:03:39    14:03:49    0
7   15:36:20    15:36:20    0
8   15:46:47    15:46:47    0

What's the efficient way to compare the start time of the current record with previous records?将当前记录的开始时间与以前的记录进行比较的有效方法是什么?

Query duplicated() and change astype to int查询duplicated()并将 astype 更改为int

df['isDup']=(df['Start time'].duplicated(False)|df['End time'].duplicated(False)).astype(int)

Or did you need或者你需要

df['isDup']=(df['Start time'].between(df['Start time'].shift(),df['End time'].shift())).astype(int)

Map the time like values in columns start_time and end_time to pandas TimeDelta objects and subtract 1 seconds from the 00:00:00 timedelta values in end_time column.start_timeend_time列中的类似timeend_timeTimeDelta对象,并从end_time列中的00:00:00 timedelta 值中减去1 seconds

c = ['start_time', 'end_time']
s, e = df[c].astype(str).apply(pd.to_timedelta).to_numpy().T
e[e == pd.Timedelta(0)] += pd.Timedelta(days=1, seconds=-1)

Then for each pair of start_time and end_time in the dataframe df mark the corresponding duplicate intervals using numpy broadcasting :然后对于数据帧df中的每一对start_timeend_time使用numpy broadcasting标记相应的重复间隔:

m = (s[:, None] >= s) & (e[:, None] <= e)
np.fill_diagonal(m, False)
df['isDupe'] = (m.any(1) & ~df[c].duplicated(keep=False)).view('i1')

# example 1
  start_time  end_time  isDupe
0   00:12:38  01:00:02       0
1   00:55:14  01:00:02       1
2   01:00:02  01:32:40       0
3   01:00:02  01:08:40       1
4   01:41:22  03:56:23       0
5   18:58:26  19:16:49       0
6   20:12:37  20:52:49       0
7   20:55:16  22:02:50       0
8   22:21:24  22:48:50       0
9   23:11:30  00:00:00       0

# example 2
  start_time  end_time  isDupe
0   13:32:54  13:32:55       0
1   13:36:10  13:50:16       0
2   13:37:54  13:38:14       1
3   13:46:38  13:46:45       1
4   13:48:59  13:49:05       1
5   13:50:16  13:50:20       0
6   14:03:39  14:03:49       0
7   15:36:20  15:36:20       0
8   15:46:47  15:46:47       0

Here's my solution to the above question.这是我对上述问题的解决方案。 However, if there are any efficient way, I would be happy to accept it.但是,如果有任何有效的方法,我会很乐意接受。 Thanks!谢谢!

def getDuplicate(data):
    data['check_time'] = data.iloc[-1]['start_time']
    data['isDup'] = data.apply(lambda x: 1 
                               if (x['start_time'] <= x['check_time']) & (x['check_time'] < x['end_time']) 
                               else 0 
                               , axis = 1)

    return data['isDup'].sum()

limit = 1
df_copy = df.copy()
df['isDup'] = 0

for i, row in df.iterrows():
    data = df_copy.iloc[:limit]
    isDup = getDuplicate(data)
    limit = limit + 1

    if isDup > 1:
        df.at[i, 'isDup'] = 1
    else:
        df.at[i, 'isDup'] = 0

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM