简体   繁体   中英

Filter duplicated rows based on selected columns and comparing with another dataframe in Pandas

Given two dataframes as follows:

import pandas as pd 

# Creating a DataFrame object   
df1 = pd.DataFrame([('Stuti', 28, 'Varanasi'), 
            ('Saumya', 32, 'Delhi'), 
            ('Aaditya', 25, 'Mumbai'), 
            ('Saumya', 32, 'Delhi')],  
                   columns = ['Name', 'Score', 'City']) 

df2 = pd.DataFrame([('Saumya', 32, 'Delhi'), 
            ('Saumya', 32, 'Mumbai'), 
            ('Aaditya', 40, 'Mumbai'), 
            ('Seema', 32, 'Delhi')],  
                   columns = ['Name', 'Score', 'City'])

How could I create a mask for df2 to filter duplicated rows based on df1 and columns Name and City , if same paire existed in df1 , then return check column with Duplicated , otherwise, return None .

The expected result will like:

    Name  Score      City       Check
0   Saumya     32     Delhi  Duplicated
1   Saumya     32    Mumbai        None
2  Aaditya     40  Dehradun  Duplicated
3    Seema     32     Delhi        None

Updated code:

df = pd.concat([df1, df2])

df[df.duplicated(['Name', 'City'])] 

Out:

      Name  Score    City
3   Saumya     32   Delhi
0   Saumya     32   Delhi
2  Aaditya     40  Mumbai
In [65]: df2.merge(df1[['Name', 'City']].drop_duplicates(), how='left', indicator='Check').assign(Check=lambda x: np.where(x['Check'] == 'both', 'Duplicated', None))
Out[65]:
      Name  Score    City       Check
0   Saumya     32   Delhi  Duplicated
1   Saumya     32  Mumbai        None
2  Aaditya     40  Mumbai  Duplicated
3    Seema     32   Delhi        None

You can compare both columns converted to Multiindex form compare by pairs:

m = df2.set_index(['Name','City']).index.isin(df1.set_index(['Name','City']).index)
df2['Check'] = np.where(m, 'Duplicated', None)
print (df2)
      Name  Score    City       Check
0   Saumya     32   Delhi  Duplicated
1   Saumya     32  Mumbai        None
2  Aaditya     40  Mumbai  Duplicated
3    Seema     32   Delhi        None

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM