识别另一列中具有不同值的重复行 pandas dataframe

Question

Suppose I have a dataframe of names and countries:假设我有一个 dataframe 的名字和国家：

ID  FirstName   LastName    Country
1   Paulo       Cortez      Brasil
2   Paulo       Cortez      Brasil
3   Paulo       Cortez      Espanha
4   Maria       Lurdes      Espanha
5   Maria       Lurdes      Espanha
6   John        Page        USA
7   Felipe      Cardoso     Brasil
8   John        Page        USA
9   Felipe      Cardoso     Espanha
10  Steve       Xis         UK

I need a way to identify all people that have the same firstname and lastname that appears more than once in the dataframe but at least one of the records appears belonging to another country and return all duplicated rows.我需要一种方法来识别所有具有相同名字和姓氏且在 dataframe 中出现不止一次但至少有一个记录似乎属于另一个国家并返回所有重复行的人。 This way resulting in this dataframe:这样就产生了这个 dataframe：

ID  FirstName   LastName    Country
1   Paulo       Cortez      Brasil
2   Paulo       Cortez      Brasil
3   Paulo       Cortez      Espanha
7   Felipe      Cardoso     Brasil
9   Felipe      Cardoso     Espanha

What would be the best way to achieve it?实现它的最佳方法是什么？

Answer 1

A possible solution, based on DataFrameGroupBy.filter :一个可能的解决方案，基于DataFrameGroupBy.filter ：

(df.groupby(['FirstName', 'LastName'])
 .filter(lambda x: x['Country'].nunique() > 1)
 .reset_index(drop=True))

Output: Output：

   ID FirstName LastName  Country
0   1     Paulo   Cortez   Brasil
1   2     Paulo   Cortez   Brasil
2   3     Paulo   Cortez  Espanha
3   7    Felipe  Cardoso   Brasil
4   9    Felipe  Cardoso  Espanha

Answer 2

Use boolean indexing:使用 boolean 索引：

# is the name present in several countries?
m = df.groupby(['FirstName', 'LastName'])['Country'].transform('nunique').gt(1)

out = df.loc[m]

Output: Output：

   ID FirstName LastName  Country
0   1     Paulo   Cortez   Brasil
1   2     Paulo   Cortez   Brasil
2   3     Paulo   Cortez  Espanha
6   7    Felipe  Cardoso   Brasil
8   9    Felipe  Cardoso  Espanha

Answer 3

First drop duplicates from your pandas dataframe:首先从您的 pandas dataframe 中删除重复项：

df = df.drop_duplicates()

Group by FirstName and LastName to count the number of times a given first and last name pair is associated with a different country:按FirstName和LastName分组以计算给定的名字和姓氏对与不同国家相关联的次数：

new_df = df.groupby(['FirstName', 'LastName']).size().reset_index(name='counts')

Then keep only rows for which count is larger than 1:然后只保留计数大于 1 的行：

new_df=new_df[new_df.counts > 1]

You can then merge your initial df with the new_df on FirstName and LastName :然后，您可以将初始df与FirstName和LastName上的new_df合并：

pd.merge(df, new_df, on=['FirstName', 'LastName'])

This returns:这将返回：

    FirstName   LastName    Country     counts
0   Paulo       Cortez      Brasil           3
1   Paulo       Cortez      Brasil           3
2   Paulo       Cortez      Espanha          3
3   Felipe      Cardoso     Brasil           2
4   Felipe      Cardoso     Espanha          2

识别另一列中具有不同值的重复行 pandas dataframe

问题描述

3 个解决方案

解决方案1
1 2022-12-12 19:22:47

解决方案2
1 已采纳 2022-12-12 19:29:19

解决方案3
0 2022-12-12 19:30:19

识别另一列中具有不同值的重复行 pandas dataframe

问题描述

3 个解决方案

解决方案1 1 2022-12-12 19:22:47

解决方案2 1 已采纳 2022-12-12 19:29:19

解决方案3 0 2022-12-12 19:30:19

解决方案1
1 2022-12-12 19:22:47

解决方案2
1 已采纳 2022-12-12 19:29:19

解决方案3
0 2022-12-12 19:30:19