基于元组时间戳创建访问矩阵作为输入

Question

I am doing a project for university and I am blocked from some days on a problem.我正在为大学做一个项目，但我被某个问题阻止了几天。

For start, after some manipulation on the entry data, I have this:首先，在对条目数据进行一些操作之后，我有这个：

d = pd.DataFrame({
    'ID':["007",  "001", "009"], 
    'users': [[("us1", "us2", "1577839066196", '1589200898463'), 
               ('us2', "us3", '1589476569647', '1589476734542'), 
               ('us5', 'us1', '1586234607616', '1589195456609'),
               ('us5', 'us1', '1586234607618', '1589195456689')], 
              [("us2", "us3", '1589301928018', '1589463287633'),
               ("us3", "us2", '1589463287633', '1589469006691')], 
              [('us1', 'us2', '1589931863229', '1589931878670')]] })

The 'users' are a list of tuples of (user1, user2, timestamp user1, timestamp user2). 'users' 是（user1，user2，timestamp user1，timestamp user2）的元组列表。 This is the list of users that, on these timestamps, accessed the ID.这是在这些时间戳上访问该 ID 的用户列表。

What I want to do is to create a matrix with the accesses count, that I am calling 'access_interest'.我想要做的是创建一个具有访问计数的矩阵，我称之为“access_interest”。 So it would be:所以它会是：

For each (user1, ID, timestamp1) where to_date(timestamp1) < T
  For each user user2:
    If (user1, user2) exists for this ID
      access_interest(user)[ID] += 1

Edit编辑

The expected output should be:预期的 output 应该是：

On the picture you can see that 'us2' for '007' has the number 1. This is because on the first 'for each' when we fix 'us1' and '007' we will have (us1, us2) existent for 007, so we add 1 on the us2.在图片上，您可以看到 '007' 的 'us2' 的数字为 1。这是因为在第一个 'for each' 上，当我们修复 'us1' 和 '007' 时，我们将 (us1, us2) 存在于 007 ，所以我们在us2上加1。

The same for us3, when we fix 'us2' and '007' on the first 'for each' we will have (us2, us3), so we add one on us3 for the 007. us3 也是如此，当我们将“us2”和“007”固定在第一个“for each”上时，我们将拥有 (us2,us3)，因此我们在 us3 上为 007 添加一个。

Answer 1

In [223]: d['users_list'] = d['users'].apply(lambda x: [(y[0]) for y in x ]if isinstance(x,list) else [x[0]])

In [224]: all_users = sorted(list(set(sum([x for x in d['users_list']],[]))))

In [225]: for us in all_users:
     ...:     d[us] = d['users_list'].apply(lambda x :  1 if us in x else 0)
     ...:

In [226]: d
Out[226]:
    ID                                              users            users_list  us1  us2  us3  us5
0  007  [(us1, us2, 1577839066196, 1589200898463), (us...  [us1, us2, us5, us5]    1    1    0    1
1  001  [(us2, us3, 1589301928018, 1589463287633), (us...            [us2, us3]    0    1    1    0
2  009           (us1, us2, 1589931863229, 1589931878670)                 [us1]    1    0    0    0

output: output：

In [227]: d.set_index(['ID'])[all_users]
Out[227]:
     us1  us2  us3  us5
ID
007    1    1    0    1
001    0    1    1    0
009    1    0    0    0

基于元组时间戳创建访问矩阵作为输入

问题描述

Edit编辑

1 个解决方案

解决方案1
1 已采纳 2020-07-18 13:36:56

基于元组时间戳创建访问矩阵作为输入

问题描述

Edit编辑

1 个解决方案

解决方案1 1 已采纳 2020-07-18 13:36:56

解决方案1
1 已采纳 2020-07-18 13:36:56