简体   繁体   English

根据两列之间的关系创建唯一的ID

[英]Create unique id based on the relation between two columns

I am working with a big dataset (2M+ rows) that looks like the following: 我正在处理一个大型数据集(超过2M行),如下所示:

Id  TeamId  UserId
43  504     722
44  504     727
45  601     300
46  602     722
47  602     727
48  605     300
49  777     300
50  777     301
51  788     400
52  789     400
53  100     727

In this case, TeamId 504 and 602 are the same, 601 matches with 605 but not with 777 (because it has one more person in the team). 在这种情况下,TeamId 504和602是相同的,601与605匹配,但与777不匹配(因为它在团队中还有一个人)。

My goal is to generate unique IDs for each "unique" team: 我的目标是为每个“独特”的团队生成唯一的ID:

Id  TeamId  UserId  UniqueId
43  504     722     0
44  504     727     0
45  601     300     1
46  602     722     0
47  602     727     0
48  605     300     1
49  777     300     2
50  777     301     2
51  788     400     3
52  789     400     3
53  100     727     4

A person can be in a team of 1, like in the case of UserId 727: he's part of team 504 (with UserId 722) and of team 100 (alone). 一个人可以是1人一组,就像UserId 727一样:他是团队504(具有UserId 722)和团队100(单独)的一部分。 This should generate 2 different unique ids for the two teams. 这将为两个团队生成2个不同的唯一ID。

I cannot groupBy by TeamId only as it will detect TeamId 504 and 602 as different teams, nor I can by UserId because it will not keep track of the teams. 我不能仅按TeamId进行分组,因为它会将TeamId 504和602检测为不同的团队,也不能按UserId进行分组,因为它无法跟踪团队。

From my understanding, this might be a network problem. 据我了解,这可能是网络问题。 I have found a similar query to this here: Groupby two column values and create a unique id 我在这里找到了与此类似的查询:用两个列值分组并创建唯一的ID

How can I achieve this? 我该如何实现? Any help would be appreciated. 任何帮助,将不胜感激。

For each row create a new variable (maybe a tuple) that have the members of that team. 对于每一行,创建一个具有该团队成员的新变量(可能是一个元组)。

Id  TeamId  UserId  NewVar
43  504     722     (722, 727)
44  504     727     (722, 727)
45  601     300     (300)
46  602     722     (722, 727)
47  602     727     (722, 727)
48  605     300     (300)
49  777     300     (300, 301)
50  777     301     (300, 301)
51  788     400     (400)
52  789     400     (400)
53  100     727     (727)

after this step compare the NewVar and assign the id Ps: don't forget to order the NewVar 在此步骤之后,比较NewVar并分配ID P:不要忘记订购NewVar

You can use pivot_table to get in index TeamId and in columns UserId , each row showing which users are in each team, such as: 您可以使用pivot_table在指数得到TeamId和列UserId ,每行显示哪些用户在每个团队,如:

dfp = df.pivot_table( values='Id', index='TeamId', columns='UserId', 
                      aggfunc=np.any, fill_value=False)
print (dfp)                            
UserId    300    301    400    722    727
TeamId                                   
100     False  False  False  False   True
504     False  False  False   True   True
601      True  False  False  False  False
602     False  False  False   True   True
605      True  False  False  False  False
777      True   True  False  False  False
788     False  False   True  False  False
789     False  False   True  False  False

Then to be able to get the UniqueId, you can sort_values by all columns, use the diff between two rows, find if any per rows meaning of different groups and cumsum such as: 然后,为了能够获得UniqueId,您可以按所有列进行sort_values ,使用两行之间的diff ,查找每行是否any不同组和cumsum含义,例如:

print (dfp.sort_values(dfp.columns.tolist()).diff().any(1).cumsum())
TeamId
100    0
504    1 #same number for 504 and 602 but not 100 as you want
602    1
788    2
789    2
601    3
605    3
777    4
dtype: int64

so to get the new column, you can use map : 因此,要获取新列,可以使用map

df['UniqueId'] = df.TeamId.map(dfp.sort_values(dfp.columns.tolist())
                                  .diff().abs().any(1).cumsum())
print (df)
    Id  TeamId  UserId  UniqueId
0   43     504     722         1
1   44     504     727         1
2   45     601     300         3
3   46     602     722         1
4   47     602     727         1
5   48     605     300         3
6   49     777     300         4
7   50     777     301         4
8   51     788     400         2
9   52     789     400         2
10  53     100     727         0

Use 2 groupby to have the result: 使用2 groupby获得结果:

import pandas as pd

df = pd.DataFrame( {'Id'    :[43,44,45,46,47,48,49,50,51,52,53],
                    'TeamId':[504,504,601,602,602,605,777,777,788,789,100],
                    'UserId':[722,727,300,722,727,300,300,301,400,400,727]})

df_grouped = df.groupby('TeamId')['UserId'].apply(tuple).to_frame().reset_index()

df_grouped = df_grouped.groupby('UserId')['TeamId'].apply(tuple).to_frame().reset_index()

print(df_grouped)

result: 结果:

       UserId      TeamId
0      (300,)  (601, 605)
1  (300, 301)      (777,)
2      (400,)  (788, 789)
3  (722, 727)  (504, 602)
4      (727,)      (100,)

just iterate the TeamId column to set the team number... 只需迭代TeamId列以设置团队编号...

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM