简体   繁体   English

识别另一列中具有不同值的重复行 pandas dataframe

[英]Identify duplicated rows with different value in another column pandas dataframe

Suppose I have a dataframe of names and countries:假设我有一个 dataframe 的名字和国家:

ID  FirstName   LastName    Country
1   Paulo       Cortez      Brasil
2   Paulo       Cortez      Brasil
3   Paulo       Cortez      Espanha
4   Maria       Lurdes      Espanha
5   Maria       Lurdes      Espanha
6   John        Page        USA
7   Felipe      Cardoso     Brasil
8   John        Page        USA
9   Felipe      Cardoso     Espanha
10  Steve       Xis         UK

I need a way to identify all people that have the same firstname and lastname that appears more than once in the dataframe but at least one of the records appears belonging to another country and return all duplicated rows.我需要一种方法来识别所有具有相同名字和姓氏且在 dataframe 中出现不止一次但至少有一个记录似乎属于另一个国家并返回所有重复行的人。 This way resulting in this dataframe:这样就产生了这个 dataframe:

ID  FirstName   LastName    Country
1   Paulo       Cortez      Brasil
2   Paulo       Cortez      Brasil
3   Paulo       Cortez      Espanha
7   Felipe      Cardoso     Brasil
9   Felipe      Cardoso     Espanha

What would be the best way to achieve it?实现它的最佳方法是什么?

A possible solution, based on DataFrameGroupBy.filter :一个可能的解决方案,基于DataFrameGroupBy.filter

(df.groupby(['FirstName', 'LastName'])
 .filter(lambda x: x['Country'].nunique() > 1)
 .reset_index(drop=True))

Output: Output:

   ID FirstName LastName  Country
0   1     Paulo   Cortez   Brasil
1   2     Paulo   Cortez   Brasil
2   3     Paulo   Cortez  Espanha
3   7    Felipe  Cardoso   Brasil
4   9    Felipe  Cardoso  Espanha

Use boolean indexing:使用 boolean 索引:

# is the name present in several countries?
m = df.groupby(['FirstName', 'LastName'])['Country'].transform('nunique').gt(1)

out = df.loc[m]

Output: Output:

   ID FirstName LastName  Country
0   1     Paulo   Cortez   Brasil
1   2     Paulo   Cortez   Brasil
2   3     Paulo   Cortez  Espanha
6   7    Felipe  Cardoso   Brasil
8   9    Felipe  Cardoso  Espanha

First drop duplicates from your pandas dataframe:首先从您的 pandas dataframe 中删除重复项:

df = df.drop_duplicates()

Group by FirstName and LastName to count the number of times a given first and last name pair is associated with a different country:FirstNameLastName分组以计算给定的名字和姓氏对与不同国家相关联的次数:

new_df = df.groupby(['FirstName', 'LastName']).size().reset_index(name='counts')

Then keep only rows for which count is larger than 1:然后只保留计数大于 1 的行:

new_df=new_df[new_df.counts > 1]

You can then merge your initial df with the new_df on FirstName and LastName :然后,您可以将初始dfFirstNameLastName上的new_df合并:

pd.merge(df, new_df, on=['FirstName', 'LastName'])

This returns:这将返回:

    FirstName   LastName    Country     counts
0   Paulo       Cortez      Brasil           3
1   Paulo       Cortez      Brasil           3
2   Paulo       Cortez      Espanha          3
3   Felipe      Cardoso     Brasil           2
4   Felipe      Cardoso     Espanha          2

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 Python Pandas使用Additional Column标识重复的行 - Python Pandas Identify Duplicated rows with Additional Column Pandas数据帧 - 在任何列中标识值超过阈值的行 - Pandas dataframe - identify rows with value over threshold in any column 计算pandas数据框中另一列对值分组之前的行数 - count number of rows before a value group by another column in pandas dataframe Determining Values in Pandas Dataframe 基于另一列中的前几行值 - Determining Values in Pandas Dataframe Based on Previous Rows Value in Another Column Python pandas dataframe:删除列中的值存在于另一个中的行 - Python pandas dataframe: delete rows where value in column exists in another pandas 用另一个 dataframe 列的具有重复索引的单元格值更新列 - pandas update a column with another dataframe columns's cell value with duplicated index 根据选定的列过滤重复的行,并与 Pandas 中的另一个 dataframe 进行比较 - Filter duplicated rows based on selected columns and comparing with another dataframe in Pandas 隐藏大熊猫DataFrame中的重复行 - Hide duplicated rows in a pandas DataFrame Pandas 在一个 dataframe 中删除与另一个 dataframe 的列中的行共享一个共同值的行 - Pandas drop rows in one dataframe that share a common value with a rows in a column of another dataframe 创建标识另一列的第一个日期的列 pandas dataframe - Create column that identify first date of another column pandas dataframe
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM