[英]Identify duplicated rows with different value in another column pandas dataframe
Suppose I have a dataframe of names and countries:假设我有一个 dataframe 的名字和国家:
ID FirstName LastName Country
1 Paulo Cortez Brasil
2 Paulo Cortez Brasil
3 Paulo Cortez Espanha
4 Maria Lurdes Espanha
5 Maria Lurdes Espanha
6 John Page USA
7 Felipe Cardoso Brasil
8 John Page USA
9 Felipe Cardoso Espanha
10 Steve Xis UK
I need a way to identify all people that have the same firstname and lastname that appears more than once in the dataframe but at least one of the records appears belonging to another country and return all duplicated rows.我需要一种方法来识别所有具有相同名字和姓氏且在 dataframe 中出现不止一次但至少有一个记录似乎属于另一个国家并返回所有重复行的人。 This way resulting in this dataframe:
这样就产生了这个 dataframe:
ID FirstName LastName Country
1 Paulo Cortez Brasil
2 Paulo Cortez Brasil
3 Paulo Cortez Espanha
7 Felipe Cardoso Brasil
9 Felipe Cardoso Espanha
What would be the best way to achieve it?实现它的最佳方法是什么?
A possible solution, based on DataFrameGroupBy.filter
:一个可能的解决方案,基于
DataFrameGroupBy.filter
:
(df.groupby(['FirstName', 'LastName'])
.filter(lambda x: x['Country'].nunique() > 1)
.reset_index(drop=True))
Output: Output:
ID FirstName LastName Country
0 1 Paulo Cortez Brasil
1 2 Paulo Cortez Brasil
2 3 Paulo Cortez Espanha
3 7 Felipe Cardoso Brasil
4 9 Felipe Cardoso Espanha
Use boolean indexing:使用 boolean 索引:
# is the name present in several countries?
m = df.groupby(['FirstName', 'LastName'])['Country'].transform('nunique').gt(1)
out = df.loc[m]
Output: Output:
ID FirstName LastName Country
0 1 Paulo Cortez Brasil
1 2 Paulo Cortez Brasil
2 3 Paulo Cortez Espanha
6 7 Felipe Cardoso Brasil
8 9 Felipe Cardoso Espanha
First drop duplicates from your pandas dataframe:首先从您的 pandas dataframe 中删除重复项:
df = df.drop_duplicates()
Group by FirstName
and LastName
to count the number of times a given first and last name pair is associated with a different country:按
FirstName
和LastName
分组以计算给定的名字和姓氏对与不同国家相关联的次数:
new_df = df.groupby(['FirstName', 'LastName']).size().reset_index(name='counts')
Then keep only rows for which count is larger than 1:然后只保留计数大于 1 的行:
new_df=new_df[new_df.counts > 1]
You can then merge your initial df
with the new_df
on FirstName
and LastName
:然后,您可以将初始
df
与FirstName
和LastName
上的new_df
合并:
pd.merge(df, new_df, on=['FirstName', 'LastName'])
This returns:这将返回:
FirstName LastName Country counts
0 Paulo Cortez Brasil 3
1 Paulo Cortez Brasil 3
2 Paulo Cortez Espanha 3
3 Felipe Cardoso Brasil 2
4 Felipe Cardoso Espanha 2
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.