简体   繁体   中英

How to find common elements in several dataframes

I have the following dataframes:

df1 = pd.DataFrame({'col1': ['A','M','C'],
                   'col2': ['B','N','O'],
                   # plus many more
                   })

df2 = pd.DataFrame({'col3': ['A','A','A','B','B','B'],
                    'col4': ['M','P','Q','J','P','M'],
                   # plus many more
                   })

Which look like these:

df1:

col1   col2
A      B
M      N
C      O
#...plus many more

df2:

col3   col4
A      M
A      P
A      Q
B      J
B      P
B      M
#...plus many more

The objective is to create a dataframe containing all elements in col4 for each col3 that occurs in one row in df1 . For example, let's look at row 1 of df1 . We see that A is in col1 and B is in col2 . Then, we go to df2 , and check what col4 is for df2[df2['col3'] == 'A'] and df2[df2['col3'] == 'B'] . We get, for A : ['M','P','Q'] , and for B , ['J','P','M'] . The intersection of these is ['M', 'P'] , so what I want is something like this

col1   col2    col4
A      B       M
A      B       P
....(and so on for the other rows)

The naive way to go about this is to iterate over rows and then get the intersection, but I was wondering if it's possible to solve this via merging techniques or other faster methods. So far, I can't think of any way how.

This should achieve what you want, using a combination of merge , groupby and set intersection:

# Getting tuple of all col1=col3 values in col4
df3 = pd.merge(df1, df2, left_on='col1', right_on='col3')
df3 = df3.groupby(['col1', 'col2'])['col4'].apply(tuple)
df3 = df3.reset_index()

# Getting tuple of all col2=col3 values in col4
df3 = pd.merge(df3, df2, left_on='col2', right_on='col3')
df3 = df3.groupby(['col1', 'col2', 'col4_x'])['col4_y'].apply(tuple)
df3 = df3.reset_index()

# Taking set intersection of our two tuples
df3['col4'] = df3.apply(lambda row: set(row['col4_x']) & set(row['col4_y']), axis=1)

# Dropping unnecessary columns
df3 = df3.drop(['col4_x', 'col4_y'], axis=1)

print(df3)

  col1 col2    col4
0    A    B  {P, M}

If required, see this answer for examples of how to 'melt' col4 .

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM