[英]Python Pandas Identify Duplicated rows with Additional Column
[英]Identify duplicated rows with different value in another column pandas dataframe
假設我有一個 dataframe 的名字和國家:
ID FirstName LastName Country
1 Paulo Cortez Brasil
2 Paulo Cortez Brasil
3 Paulo Cortez Espanha
4 Maria Lurdes Espanha
5 Maria Lurdes Espanha
6 John Page USA
7 Felipe Cardoso Brasil
8 John Page USA
9 Felipe Cardoso Espanha
10 Steve Xis UK
我需要一種方法來識別所有具有相同名字和姓氏且在 dataframe 中出現不止一次但至少有一個記錄似乎屬於另一個國家並返回所有重復行的人。 這樣就產生了這個 dataframe:
ID FirstName LastName Country
1 Paulo Cortez Brasil
2 Paulo Cortez Brasil
3 Paulo Cortez Espanha
7 Felipe Cardoso Brasil
9 Felipe Cardoso Espanha
實現它的最佳方法是什么?
一個可能的解決方案,基於DataFrameGroupBy.filter
:
(df.groupby(['FirstName', 'LastName'])
.filter(lambda x: x['Country'].nunique() > 1)
.reset_index(drop=True))
Output:
ID FirstName LastName Country
0 1 Paulo Cortez Brasil
1 2 Paulo Cortez Brasil
2 3 Paulo Cortez Espanha
3 7 Felipe Cardoso Brasil
4 9 Felipe Cardoso Espanha
使用 boolean 索引:
# is the name present in several countries?
m = df.groupby(['FirstName', 'LastName'])['Country'].transform('nunique').gt(1)
out = df.loc[m]
Output:
ID FirstName LastName Country
0 1 Paulo Cortez Brasil
1 2 Paulo Cortez Brasil
2 3 Paulo Cortez Espanha
6 7 Felipe Cardoso Brasil
8 9 Felipe Cardoso Espanha
首先從您的 pandas dataframe 中刪除重復項:
df = df.drop_duplicates()
按FirstName
和LastName
分組以計算給定的名字和姓氏對與不同國家相關聯的次數:
new_df = df.groupby(['FirstName', 'LastName']).size().reset_index(name='counts')
然后只保留計數大於 1 的行:
new_df=new_df[new_df.counts > 1]
然后,您可以將初始df
與FirstName
和LastName
上的new_df
合並:
pd.merge(df, new_df, on=['FirstName', 'LastName'])
這將返回:
FirstName LastName Country counts
0 Paulo Cortez Brasil 3
1 Paulo Cortez Brasil 3
2 Paulo Cortez Espanha 3
3 Felipe Cardoso Brasil 2
4 Felipe Cardoso Espanha 2
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.