[英]What's the most efficient way to find intersections from secondary tables based on a pair of columns in a pandas DataFrame?
I have 3 DataFrames in Pandas:我在 Pandas 中有 3 个数据帧:
UserItem is a DataFrame of users and items that the users chose, with 2 columns, User and Item. UserItem 是用户选择的用户和项目的 DataFrame,有 2 列,用户和项目。
UserTag is a DataFrame of users and tags, with 2 columns, User and Tag. UserTag 是用户和标签的 DataFrame,有 2 列,用户和标签。
ItemTag is a DataFrame of items and tags, with 2 columns, Item and Tag. ItemTag 是项目和标签的 DataFrame,有 2 列,项目和标签。
UserItem_df = pd.DataFrame({'user': ['A', 'B', 'B'] , 'item': ['i', 'j', 'k']})
UserTag_df = pd.DataFrame({'user': ['A', 'B'] , 'tag' : ['T', 'R']})
ItemTag_df = pd.DataFrame({'item': ['i', 'j', 'k', 'k'] , 'tag' : ['T', 'S', 'T', 'R']})
I want to compute, for each (user, item) pair in UserItem, the size of the intersection (and union as well.) of the tags of that user with the tags of that item.我想为 UserItem 中的每个(用户,项目)对计算该用户的标签与该项目的标签的交集(以及联合)的大小。
Answer_df = pd.DataFrame({'user': ['A', 'B', 'B'] , 'item': ['i', 'j', 'k'], 'intersection': [1, 0, 1], 'union' : [1, 2, 2]})
What's the most efficient way to do this?最有效的方法是什么? These are DataFrames with 30M rows (
UserItem_df
), and about 500k rows for the other two.这些是具有 30M 行 (
UserItem_df
) 的 DataFrame,另外两个大约有 500k 行。 The product set of all possible (user, item) pairs is about 30 billion - I don't need the intersection and unions for all possible pairs, just the ones in the UserItem dataframe.所有可能的(用户,项目)对的产品集约为 300 亿 - 我不需要所有可能对的交集和并集,只需要 UserItem dataframe 中的那些。
Use:利用:
# step 1:
df1 = pd.merge(UserItem_df, UserTag_df, on='user')
# step 2:
df2 = pd.merge(UserItem_df, ItemTag_df, on='item')
# step 3
df3 = pd.concat([df1, df2], ignore_index=True)
# step 4
df3 = (
df3.groupby(['user', 'item'])['tag']
.agg(intersection='count', union='nunique')
.reset_index()
)
df3['intersection'] -= df3['union']
Steps:脚步:
# step 1: df1
user item tag
0 A i T
1 B j R
2 B k R
# step 2: df2
user item tag
0 A i T
1 B j S
2 B k T
3 B k R
# step 3: df3
user item tag
0 A i T
1 B j R
2 B k R
3 A i T
4 B j S
5 B k T
6 B k R
# step 4: df3
user item intersection union
0 A i 1 1
1 B j 0 2
2 B k 1 2
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.