![](/img/trans.png)
[英]Pandas: How to drop column values that are duplicates but keep certain row values
[英]Remove duplicates and keep row that certain column is Yes in a pandas dataframe
我有一個 dataframe 在“ID”列上有重復值,如下所示:
ID Name Street Birth Job Primary?
1 Fake1 Street1 2000-01-01 Job1 Yes
2 Fake2 Street2 2000-01-02 Job2 No
3 Fake3 Street3 2000-01-03 Job3 Yes
1 Fake1 Street1 2000-01-01 Job4 No
2 Fake2 Street2 2000-01-02 Job5 Yes
4 Fake4 Street4 2000-01-03 Job6 Yes
1 Fake1 Street1 2000-01-01 Job7 No
我需要一種方法來刪除重復項(通過“ID”),但保留列主要為“是”的那些(所有唯一值在該列中都有“是”,重復值有一個記錄為“是”,所有其他記錄為“否”)導致此 dataframe:
ID Name Street Birth Job Primary?
1 Fake1 Street1 2000-01-01 Job1 Yes
3 Fake3 Street3 2000-01-03 Job3 Yes
2 Fake2 Street2 2000-01-02 Job5 Yes
4 Fake4 Street4 2000-01-03 Job6 Yes
最好的方法是什么?
謝謝!
使用DataFrame.sort_values
- Yes
的行位於ID
的keep='last'
,因此Primary?=Yes
使用DataFrame.drop_duplicates
Primary?=No
df = df.sort_values('Primary?').drop_duplicates('ID', keep='last')
print (df)
ID Name Street Birth Job Primary?
0 1 Fake1 Street1 2000-01-01 Job1 Yes
2 3 Fake3 Street3 2000-01-03 Job3 Yes
4 2 Fake2 Street2 2000-01-02 Job5 Yes
5 4 Fake4 Street4 2000-01-03 Job6 Yes
在源自“Primary?”的 boolean 系列上使用groupby.idxmax
柱子:
out = df.loc[df['Primary?'].eq('Yes').groupby(df['ID']).idxmax()]
output:
ID Name Street Birth Job Primary?
0 1 Fake1 Street1 2000-01-01 Job1 Yes
4 2 Fake2 Street2 2000-01-02 Job5 Yes
2 3 Fake3 Street3 2000-01-03 Job3 Yes
5 4 Fake4 Street4 2000-01-03 Job6 Yes
df = df[df['Primary?']=='yes']
df = df.drop_duplicates(subset= ['ID'])
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.