简体   繁体   English

连接两个数据帧,其中列值(一组)是另一个数据帧的子集

[英]join two dataframes where the column values (a set) is a subset of the other

I have two data frames:我有两个数据框:

df1 = pd.DataFrame([[set(['foo', 'baz'])],
                    [set(['bar', 'baz'])]], columns=['items'])



    items
0   {foo, baz}
1   {bar, baz}
df2 = pd.DataFrame([[set(['bar', 'baz', 'foo']), 1],
                    [set(['bar', 'baz', 'foo']), 2],
                    [set(['bar', 'baz', 'foo']), 3],
                    [set(['one', 'two', 'bar']), 2]], columns=['items', 'other'])



    items           other
0   {foo, bar, baz} 1
1   {foo, bar, baz} 2
2   {foo, bar, baz} 3
3   {two, one, bar} 2

The goal is to join df2 with df1 where the values in df1.items are a subset of df2.items .我们的目标是加入df2df1其中值df1.items是一个子集df2.items Both columns are a set()两列都是一个 set()

For context, this is to join association rules with customer purchases after implementing the apriori algorithm.对于上下文,这是在实现 apriori 算法后将关联规则与客户购买结合起来。

Adding expected output:添加预期输出:

df3 = pd.DataFrame([[[set(['foo', 'baz'])], set(['bar', 'baz', 'foo']), 1],
                    [[set(['foo', 'baz'])], set(['bar', 'baz', 'foo']), 2],
                    [[set(['foo', 'baz'])], set(['bar', 'baz', 'foo']), 3],
                    [[set(['bar', 'baz'])], None, None]], columns=['items', 'items', 'other'])


    items           items           other
0   [{foo, baz}]    {foo, bar, baz} 1.0
1   [{foo, baz}]    {foo, bar, baz} 2.0
2   [{foo, baz}]    {foo, bar, baz} 3.0
3   [{bar, baz}]    None    NaN

Create your dataframes创建您的数据框

import pandas as pd

df1 = pd.DataFrame({'key': [1, 1],
                    'id': [0, 1],
                    'items': [set(['foo', 'baz']), set(['bar', 'baz'])]})

df2 = pd.DataFrame({'key': [1, 1, 1, 1],
                    'items': [set(['bar', 'baz', 'foo']), set(['bar', 'baz', 'foo']), set(['bar', 'baz', 'foo']), set(['one', 'two', 'bar'])],
                    'other': [1, 2, 3, 2]
                   })

then make a cartesian product然后做一个笛卡尔积

merged_df = df1.merge(df2, on='key')
merged_df

   key  id     items_x          items_y  other
0    1   0  {baz, foo}  {foo, baz, bar}      1
1    1   0  {baz, foo}  {foo, baz, bar}      2
2    1   0  {baz, foo}  {foo, baz, bar}      3
3    1   0  {baz, foo}  {one, bar, two}      2
4    1   1  {baz, bar}  {foo, baz, bar}      1
5    1   1  {baz, bar}  {foo, baz, bar}      2
6    1   1  {baz, bar}  {foo, baz, bar}      3
7    1   1  {baz, bar}  {one, bar, two}      2

define your custom function and see if it works in one case定义您的自定义函数并查看它是否适用于一种情况

def check_if_all_in_list(list1, list2):
    return all(elem in list2 for elem in list1)

check_if_all_in_list(merged_df['items_x'][0], merged_df['items_y'][0])
True

Create your match创建您的匹配

merged_df['check'] = merged_df.apply(lambda row: check_if_all_in_list(row['items_x'], row['items_y']), axis=1)
merged_df

   key  id     items_x          items_y  other  check
0    1   0  {baz, foo}  {foo, baz, bar}      1   True
1    1   0  {baz, foo}  {foo, baz, bar}      2   True
2    1   0  {baz, foo}  {foo, baz, bar}      3   True
3    1   0  {baz, foo}  {one, bar, two}      2  False
4    1   1  {baz, bar}  {foo, baz, bar}      1   True
5    1   1  {baz, bar}  {foo, baz, bar}      2   True
6    1   1  {baz, bar}  {foo, baz, bar}      3   True
7    1   1  {baz, bar}  {one, bar, two}      2  False

now filter out what you don't want现在过滤掉你不想要的

mask = (merged_df['check']==True)
merged_df[mask]

   key  id     items_x          items_y  other  check
0    1   0  {baz, foo}  {foo, baz, bar}      1   True
1    1   0  {baz, foo}  {foo, baz, bar}      2   True
2    1   0  {baz, foo}  {foo, baz, bar}      3   True
4    1   1  {baz, bar}  {foo, baz, bar}      1   True
5    1   1  {baz, bar}  {foo, baz, bar}      2   True
6    1   1  {baz, bar}  {foo, baz, bar}      3   True

In case if you want to simply filter df2 as per the condition (so kind of like select ... from table where X in (select ...) ) - you can do:如果您想根据条件简单地过滤df2 (有点像select ... from table where X in (select ...) ) - 你可以这样做:

df2.loc[df2["items"].apply(lambda x: any(el.intersection(x)==el for el in df1["items"].tolist()))]

Output:输出:

   items                other
0  {foo, baz, bar}      1
1  {foo, baz, bar}      2
2  {foo, baz, bar}      3

To achieve "left join"-like effect:实现类似“左连接”的效果:

import numpy as np

df2["match"]=df2["items"].apply(lambda x: any(el.intersection(x)==el for el in df1["items"].tolist()))

df2.loc[~df2["match"], ["other"]]=np.nan

df2.drop(columns="match", inplace=True)

Output:输出:

   items              other
0  {bar, baz, foo}    1.0
1  {bar, baz, foo}    2.0
2  {bar, baz, foo}    3.0
3  {two, bar, one}    NaN

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 加入两个数据帧并替换 Python 中的列值 - JOIN two DataFrames and replace Column values in Python 如何加入两个数据框,其中一列包含两个或多个值 - How can I join two dataframes where one column holds two or more values 合并列值匹配的两个数据框 - Combine two dataframes where column values match 合并两个熊猫数据帧,其中一个是另一个的子集(或仅填充列的子集) - Merge two pandas dataframes where one is a subset of the other (or populate only a subset of columns) 熊猫:在列值重复的列上联接或合并多个数据框 - Pandas: Join or merge multiple dataframes on a column where column values are repeating 通过比较不同数据框中的其他两列来连接一列 - join a column by comparing two other columns in different dataframes 如何连接列值在一定范围内的两个数据框? - How to join two dataframes for which column values are within a certain range? 在具有匹配值的特定列上连接两个 Pandas DataFrame - Join two Pandas DataFrames on specific column with matching values 熊猫内部连接两个数据框并汇总列值 - pandas inner join two dataframes and aggregate column values Python Pandas 比较具有相似(字符串)列的两个数据帧,其中 1 个 df 的值是另一个 df 值的子字符串 - Python Pandas compare two dataframes with a similar (string) column, where 1 df's values are substrings of the other df's values
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM