连接两个数据帧，其中列值（一组）是另一个数据帧的子集

Question

I have two data frames:我有两个数据框：

df1 = pd.DataFrame([[set(['foo', 'baz'])],
                    [set(['bar', 'baz'])]], columns=['items'])



    items
0   {foo, baz}
1   {bar, baz}

df2 = pd.DataFrame([[set(['bar', 'baz', 'foo']), 1],
                    [set(['bar', 'baz', 'foo']), 2],
                    [set(['bar', 'baz', 'foo']), 3],
                    [set(['one', 'two', 'bar']), 2]], columns=['items', 'other'])



    items           other
0   {foo, bar, baz} 1
1   {foo, bar, baz} 2
2   {foo, bar, baz} 3
3   {two, one, bar} 2

The goal is to join df2 with df1 where the values in df1.items are a subset of df2.items .我们的目标是加入df2与df1其中值df1.items是一个子集df2.items 。 Both columns are a set()两列都是一个 set()

For context, this is to join association rules with customer purchases after implementing the apriori algorithm.对于上下文，这是在实现 apriori 算法后将关联规则与客户购买结合起来。

Adding expected output:添加预期输出：

df3 = pd.DataFrame([[[set(['foo', 'baz'])], set(['bar', 'baz', 'foo']), 1],
                    [[set(['foo', 'baz'])], set(['bar', 'baz', 'foo']), 2],
                    [[set(['foo', 'baz'])], set(['bar', 'baz', 'foo']), 3],
                    [[set(['bar', 'baz'])], None, None]], columns=['items', 'items', 'other'])


    items           items           other
0   [{foo, baz}]    {foo, bar, baz} 1.0
1   [{foo, baz}]    {foo, bar, baz} 2.0
2   [{foo, baz}]    {foo, bar, baz} 3.0
3   [{bar, baz}]    None    NaN

Answer 1

Create your dataframes创建您的数据框

import pandas as pd

df1 = pd.DataFrame({'key': [1, 1],
                    'id': [0, 1],
                    'items': [set(['foo', 'baz']), set(['bar', 'baz'])]})

df2 = pd.DataFrame({'key': [1, 1, 1, 1],
                    'items': [set(['bar', 'baz', 'foo']), set(['bar', 'baz', 'foo']), set(['bar', 'baz', 'foo']), set(['one', 'two', 'bar'])],
                    'other': [1, 2, 3, 2]
                   })

then make a cartesian product然后做一个笛卡尔积

merged_df = df1.merge(df2, on='key')
merged_df

   key  id     items_x          items_y  other
0    1   0  {baz, foo}  {foo, baz, bar}      1
1    1   0  {baz, foo}  {foo, baz, bar}      2
2    1   0  {baz, foo}  {foo, baz, bar}      3
3    1   0  {baz, foo}  {one, bar, two}      2
4    1   1  {baz, bar}  {foo, baz, bar}      1
5    1   1  {baz, bar}  {foo, baz, bar}      2
6    1   1  {baz, bar}  {foo, baz, bar}      3
7    1   1  {baz, bar}  {one, bar, two}      2

define your custom function and see if it works in one case定义您的自定义函数并查看它是否适用于一种情况

def check_if_all_in_list(list1, list2):
    return all(elem in list2 for elem in list1)

check_if_all_in_list(merged_df['items_x'][0], merged_df['items_y'][0])
True

Create your match创建您的匹配

merged_df['check'] = merged_df.apply(lambda row: check_if_all_in_list(row['items_x'], row['items_y']), axis=1)
merged_df

   key  id     items_x          items_y  other  check
0    1   0  {baz, foo}  {foo, baz, bar}      1   True
1    1   0  {baz, foo}  {foo, baz, bar}      2   True
2    1   0  {baz, foo}  {foo, baz, bar}      3   True
3    1   0  {baz, foo}  {one, bar, two}      2  False
4    1   1  {baz, bar}  {foo, baz, bar}      1   True
5    1   1  {baz, bar}  {foo, baz, bar}      2   True
6    1   1  {baz, bar}  {foo, baz, bar}      3   True
7    1   1  {baz, bar}  {one, bar, two}      2  False

now filter out what you don't want现在过滤掉你不想要的

mask = (merged_df['check']==True)
merged_df[mask]

   key  id     items_x          items_y  other  check
0    1   0  {baz, foo}  {foo, baz, bar}      1   True
1    1   0  {baz, foo}  {foo, baz, bar}      2   True
2    1   0  {baz, foo}  {foo, baz, bar}      3   True
4    1   1  {baz, bar}  {foo, baz, bar}      1   True
5    1   1  {baz, bar}  {foo, baz, bar}      2   True
6    1   1  {baz, bar}  {foo, baz, bar}      3   True

Answer 2

In case if you want to simply filter df2 as per the condition (so kind of like select ... from table where X in (select ...) ) - you can do:如果您想根据条件简单地过滤df2 （有点像select ... from table where X in (select ...) ） - 你可以这样做：

df2.loc[df2["items"].apply(lambda x: any(el.intersection(x)==el for el in df1["items"].tolist()))]

Output:输出：

   items                other
0  {foo, baz, bar}      1
1  {foo, baz, bar}      2
2  {foo, baz, bar}      3

To achieve "left join"-like effect:实现类似“左连接”的效果：

import numpy as np

df2["match"]=df2["items"].apply(lambda x: any(el.intersection(x)==el for el in df1["items"].tolist()))

df2.loc[~df2["match"], ["other"]]=np.nan

df2.drop(columns="match", inplace=True)

Output:输出：

   items              other
0  {bar, baz, foo}    1.0
1  {bar, baz, foo}    2.0
2  {bar, baz, foo}    3.0
3  {two, bar, one}    NaN

连接两个数据帧，其中列值（一组）是另一个数据帧的子集

问题描述

2 个解决方案

解决方案1
1 已采纳 2019-12-12 19:07:57

解决方案2
0 2019-12-12 19:36:31

连接两个数据帧，其中列值（一组）是另一个数据帧的子集

问题描述

2 个解决方案

解决方案1 1 已采纳 2019-12-12 19:07:57

解决方案2 0 2019-12-12 19:36:31

解决方案1
1 已采纳 2019-12-12 19:07:57

解决方案2
0 2019-12-12 19:36:31