根据 pandas DataFrame 中的一对列从辅助表中查找交集的最有效方法是什么？

Question

I have 3 DataFrames in Pandas:我在 Pandas 中有 3 个数据帧：

UserItem is a DataFrame of users and items that the users chose, with 2 columns, User and Item. UserItem 是用户选择的用户和项目的 DataFrame，有 2 列，用户和项目。

UserTag is a DataFrame of users and tags, with 2 columns, User and Tag. UserTag 是用户和标签的 DataFrame，有 2 列，用户和标签。

ItemTag is a DataFrame of items and tags, with 2 columns, Item and Tag. ItemTag 是项目和标签的 DataFrame，有 2 列，项目和标签。

UserItem_df = pd.DataFrame({'user': ['A', 'B', 'B']      ,  'item': ['i', 'j', 'k']})
UserTag_df  = pd.DataFrame({'user': ['A', 'B']           ,  'tag' : ['T', 'R']})
ItemTag_df  = pd.DataFrame({'item': ['i', 'j', 'k', 'k'] ,  'tag' : ['T', 'S', 'T', 'R']})

I want to compute, for each (user, item) pair in UserItem, the size of the intersection (and union as well.) of the tags of that user with the tags of that item.我想为 UserItem 中的每个（用户，项目）对计算该用户的标签与该项目的标签的交集（以及联合）的大小。

Answer_df = pd.DataFrame({'user': ['A', 'B', 'B']  , 'item': ['i', 'j', 'k'], 'intersection':  [1, 0, 1], 'union' : [1, 2, 2]})

What's the most efficient way to do this?最有效的方法是什么？ These are DataFrames with 30M rows ( UserItem_df ), and about 500k rows for the other two.这些是具有 30M 行 ( UserItem_df ) 的 DataFrame，另外两个大约有 500k 行。 The product set of all possible (user, item) pairs is about 30 billion - I don't need the intersection and unions for all possible pairs, just the ones in the UserItem dataframe.所有可能的（用户，项目）对的产品集约为 300 亿 - 我不需要所有可能对的交集和并集，只需要 UserItem dataframe 中的那些。

Answer 1

Use:利用：

# step 1:
df1 = pd.merge(UserItem_df, UserTag_df, on='user')

# step 2:
df2 = pd.merge(UserItem_df, ItemTag_df, on='item')

# step 3
df3 = pd.concat([df1, df2], ignore_index=True)

# step 4
df3 = (
    df3.groupby(['user', 'item'])['tag']
    .agg(intersection='count', union='nunique')
    .reset_index()
)
df3['intersection'] -= df3['union']

Steps:脚步：

# step 1: df1
  user item tag
0    A    i   T
1    B    j   R
2    B    k   R

# step 2: df2
  user item tag
0    A    i   T
1    B    j   S
2    B    k   T
3    B    k   R

# step 3: df3
  user item tag
0    A    i   T
1    B    j   R
2    B    k   R
3    A    i   T
4    B    j   S
5    B    k   T
6    B    k   R

# step 4: df3
  user item  intersection  union
0    A    i             1      1
1    B    j             0      2
2    B    k             1      2

根据 pandas DataFrame 中的一对列从辅助表中查找交集的最有效方法是什么？

问题描述

1 个解决方案

解决方案1
1 已采纳 2020-06-16 14:36:10

根据 pandas DataFrame 中的一对列从辅助表中查找交集的最有效方法是什么？

问题描述

1 个解决方案

解决方案1 1 已采纳 2020-06-16 14:36:10

解决方案1
1 已采纳 2020-06-16 14:36:10