[英]pandas drop_duplicates condition on two other columns values
I have a datframe with columns A,B and C.我有一个包含 A、B 和 C 列的数据框。
Column A is where there are duplicates. A列是有重复的地方。 Column B is where there is email value or NaN.
B 列是存在电子邮件值或 NaN 的位置。 Column C is where there is 'wait' value or a number.
C 列是有“等待”值或数字的地方。
My dataframe has duplicate values in A. I would like to keep those who have a non-NaN value in B and the non 'wait' value in C (ie numbers).我的数据框在 A 中有重复的值。我想保留那些在 B 中具有非 NaN 值和在 C 中具有非“等待”值的人(即数字)。
How could I do that on a df dataframe?我怎么能在 df 数据帧上做到这一点?
I have tried df.drop_duplicates('A') but i dont see any conditions on other columns我试过 df.drop_duplicates('A') 但我在其他列上看不到任何条件
Edit : sample data :编辑:样本数据:
df=pd.DataFrame({'A':[1,1,2,2,3,3],'B':['a@b.com',np.nan,np.nan,'c@d.com','np.nan',np.nan],'C':[123,456,567,'wait','wait','wait']})
>>> df
A B C
0 1 a@b.com 123
1 1 NaN 456
2 2 NaN 567
3 2 c@d.com wait
4 3 np.nan wait
5 3 NaN wait
I would like a resulting dataframe as我想要一个结果数据框
>>> df
A B C
0 1 a@b.com 123
1 2 c@d.com 567
2 3 np.nan wait
Thank you Best,谢谢最好的,
Solution sorting per A, C
columns with test if match wait
first and then get first non missing value if exist per groups by column A
:解决方案对每个
A, C
列进行排序,如果匹配首先wait
,然后按A
列按组获得第一个非缺失值:
df = df.sort_values(['A', 'C'], key = lambda x: x.eq('wait')).groupby('A').first()
print (df)
B C
A
1 a@b.com 123
2 c@d.com 567
3 np.nan wait
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.