[英]Create unique id based on the relation between two columns
I am working with a big dataset (2M+ rows) that looks like the following: 我正在处理一个大型数据集(超过2M行),如下所示:
Id TeamId UserId
43 504 722
44 504 727
45 601 300
46 602 722
47 602 727
48 605 300
49 777 300
50 777 301
51 788 400
52 789 400
53 100 727
In this case, TeamId 504 and 602 are the same, 601 matches with 605 but not with 777 (because it has one more person in the team). 在这种情况下,TeamId 504和602是相同的,601与605匹配,但与777不匹配(因为它在团队中还有一个人)。
My goal is to generate unique IDs for each "unique" team: 我的目标是为每个“独特”的团队生成唯一的ID:
Id TeamId UserId UniqueId
43 504 722 0
44 504 727 0
45 601 300 1
46 602 722 0
47 602 727 0
48 605 300 1
49 777 300 2
50 777 301 2
51 788 400 3
52 789 400 3
53 100 727 4
A person can be in a team of 1, like in the case of UserId 727: he's part of team 504 (with UserId 722) and of team 100 (alone). 一个人可以是1人一组,就像UserId 727一样:他是团队504(具有UserId 722)和团队100(单独)的一部分。 This should generate 2 different unique ids for the two teams. 这将为两个团队生成2个不同的唯一ID。
I cannot groupBy by TeamId only as it will detect TeamId 504 and 602 as different teams, nor I can by UserId because it will not keep track of the teams. 我不能仅按TeamId进行分组,因为它会将TeamId 504和602检测为不同的团队,也不能按UserId进行分组,因为它无法跟踪团队。
From my understanding, this might be a network problem. 据我了解,这可能是网络问题。 I have found a similar query to this here: Groupby two column values and create a unique id 我在这里找到了与此类似的查询:用两个列值分组并创建唯一的ID
How can I achieve this? 我该如何实现? Any help would be appreciated. 任何帮助,将不胜感激。
For each row create a new variable (maybe a tuple) that have the members of that team. 对于每一行,创建一个具有该团队成员的新变量(可能是一个元组)。
Id TeamId UserId NewVar
43 504 722 (722, 727)
44 504 727 (722, 727)
45 601 300 (300)
46 602 722 (722, 727)
47 602 727 (722, 727)
48 605 300 (300)
49 777 300 (300, 301)
50 777 301 (300, 301)
51 788 400 (400)
52 789 400 (400)
53 100 727 (727)
after this step compare the NewVar and assign the id Ps: don't forget to order the NewVar 在此步骤之后,比较NewVar并分配ID P:不要忘记订购NewVar
You can use pivot_table
to get in index TeamId
and in columns UserId
, each row showing which users are in each team, such as: 您可以使用pivot_table
在指数得到TeamId
和列UserId
,每行显示哪些用户在每个团队,如:
dfp = df.pivot_table( values='Id', index='TeamId', columns='UserId',
aggfunc=np.any, fill_value=False)
print (dfp)
UserId 300 301 400 722 727
TeamId
100 False False False False True
504 False False False True True
601 True False False False False
602 False False False True True
605 True False False False False
777 True True False False False
788 False False True False False
789 False False True False False
Then to be able to get the UniqueId, you can sort_values
by all columns, use the diff
between two rows, find if any
per rows meaning of different groups and cumsum
such as: 然后,为了能够获得UniqueId,您可以按所有列进行sort_values
,使用两行之间的diff
,查找每行是否any
不同组和cumsum
含义,例如:
print (dfp.sort_values(dfp.columns.tolist()).diff().any(1).cumsum())
TeamId
100 0
504 1 #same number for 504 and 602 but not 100 as you want
602 1
788 2
789 2
601 3
605 3
777 4
dtype: int64
so to get the new column, you can use map
: 因此,要获取新列,可以使用map
:
df['UniqueId'] = df.TeamId.map(dfp.sort_values(dfp.columns.tolist())
.diff().abs().any(1).cumsum())
print (df)
Id TeamId UserId UniqueId
0 43 504 722 1
1 44 504 727 1
2 45 601 300 3
3 46 602 722 1
4 47 602 727 1
5 48 605 300 3
6 49 777 300 4
7 50 777 301 4
8 51 788 400 2
9 52 789 400 2
10 53 100 727 0
Use 2 groupby to have the result: 使用2 groupby获得结果:
import pandas as pd
df = pd.DataFrame( {'Id' :[43,44,45,46,47,48,49,50,51,52,53],
'TeamId':[504,504,601,602,602,605,777,777,788,789,100],
'UserId':[722,727,300,722,727,300,300,301,400,400,727]})
df_grouped = df.groupby('TeamId')['UserId'].apply(tuple).to_frame().reset_index()
df_grouped = df_grouped.groupby('UserId')['TeamId'].apply(tuple).to_frame().reset_index()
print(df_grouped)
result: 结果:
UserId TeamId
0 (300,) (601, 605)
1 (300, 301) (777,)
2 (400,) (788, 789)
3 (722, 727) (504, 602)
4 (727,) (100,)
just iterate the TeamId column to set the team number... 只需迭代TeamId列以设置团队编号...
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.