![](/img/trans.png)
[英]Pandas: How to drop column values that are duplicates but keep certain row values
[英]Remove duplicates and keep row that certain column is Yes in a pandas dataframe
我有一个 dataframe 在“ID”列上有重复值,如下所示:
ID Name Street Birth Job Primary?
1 Fake1 Street1 2000-01-01 Job1 Yes
2 Fake2 Street2 2000-01-02 Job2 No
3 Fake3 Street3 2000-01-03 Job3 Yes
1 Fake1 Street1 2000-01-01 Job4 No
2 Fake2 Street2 2000-01-02 Job5 Yes
4 Fake4 Street4 2000-01-03 Job6 Yes
1 Fake1 Street1 2000-01-01 Job7 No
我需要一种方法来删除重复项(通过“ID”),但保留列主要为“是”的那些(所有唯一值在该列中都有“是”,重复值有一个记录为“是”,所有其他记录为“否”)导致此 dataframe:
ID Name Street Birth Job Primary?
1 Fake1 Street1 2000-01-01 Job1 Yes
3 Fake3 Street3 2000-01-03 Job3 Yes
2 Fake2 Street2 2000-01-02 Job5 Yes
4 Fake4 Street4 2000-01-03 Job6 Yes
最好的方法是什么?
谢谢!
使用DataFrame.sort_values
- Yes
的行位于ID
的keep='last'
,因此Primary?=Yes
使用DataFrame.drop_duplicates
Primary?=No
df = df.sort_values('Primary?').drop_duplicates('ID', keep='last')
print (df)
ID Name Street Birth Job Primary?
0 1 Fake1 Street1 2000-01-01 Job1 Yes
2 3 Fake3 Street3 2000-01-03 Job3 Yes
4 2 Fake2 Street2 2000-01-02 Job5 Yes
5 4 Fake4 Street4 2000-01-03 Job6 Yes
在源自“Primary?”的 boolean 系列上使用groupby.idxmax
柱子:
out = df.loc[df['Primary?'].eq('Yes').groupby(df['ID']).idxmax()]
output:
ID Name Street Birth Job Primary?
0 1 Fake1 Street1 2000-01-01 Job1 Yes
4 2 Fake2 Street2 2000-01-02 Job5 Yes
2 3 Fake3 Street3 2000-01-03 Job3 Yes
5 4 Fake4 Street4 2000-01-03 Job6 Yes
df = df[df['Primary?']=='yes']
df = df.drop_duplicates(subset= ['ID'])
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.