Given two dataframes as follows:
import pandas as pd
# Creating a DataFrame object
df1 = pd.DataFrame([('Stuti', 28, 'Varanasi'),
('Saumya', 32, 'Delhi'),
('Aaditya', 25, 'Mumbai'),
('Saumya', 32, 'Delhi')],
columns = ['Name', 'Score', 'City'])
df2 = pd.DataFrame([('Saumya', 32, 'Delhi'),
('Saumya', 32, 'Mumbai'),
('Aaditya', 40, 'Mumbai'),
('Seema', 32, 'Delhi')],
columns = ['Name', 'Score', 'City'])
How could I create a mask for df2
to filter duplicated rows based on df1
and columns Name
and City
, if same paire existed in df1
, then return check
column with Duplicated
, otherwise, return None
.
The expected result will like:
Name Score City Check
0 Saumya 32 Delhi Duplicated
1 Saumya 32 Mumbai None
2 Aaditya 40 Dehradun Duplicated
3 Seema 32 Delhi None
Updated code:
df = pd.concat([df1, df2])
df[df.duplicated(['Name', 'City'])]
Out:
Name Score City
3 Saumya 32 Delhi
0 Saumya 32 Delhi
2 Aaditya 40 Mumbai
In [65]: df2.merge(df1[['Name', 'City']].drop_duplicates(), how='left', indicator='Check').assign(Check=lambda x: np.where(x['Check'] == 'both', 'Duplicated', None))
Out[65]:
Name Score City Check
0 Saumya 32 Delhi Duplicated
1 Saumya 32 Mumbai None
2 Aaditya 40 Mumbai Duplicated
3 Seema 32 Delhi None
You can compare both columns converted to Multiindex
form compare by pairs:
m = df2.set_index(['Name','City']).index.isin(df1.set_index(['Name','City']).index)
df2['Check'] = np.where(m, 'Duplicated', None)
print (df2)
Name Score City Check
0 Saumya 32 Delhi Duplicated
1 Saumya 32 Mumbai None
2 Aaditya 40 Mumbai Duplicated
3 Seema 32 Delhi None
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.