[英]Dropping selected rows in Pandas with duplicated columns
Suppose I have a dataframe like this: 假设我有一个像这样的数据框:
fname lname email
Joe Aaron
Joe Aaron some@some.com
Bill Smith
Bill Smith
Bill Smith some2@some.com
Is there a terse and convenient way to drop rows where {fname, lname} is duplicated and email is blank? 有没有简洁方便的方法来删除{fname,lname}重复且电子邮件为空白的行?
You should first check whether your "empty" data is NaN
or empty strings. 您应该首先检查您的“空”数据是
NaN
还是空字符串。 If they are a mixture, you may need to modify the below logic. 如果它们是混合的,则可能需要修改以下逻辑。
Using pd.DataFrame.sort_values
and pd.DataFrame.drop_duplicates
: 使用
pd.DataFrame.sort_values
和pd.DataFrame.drop_duplicates
:
df = df.sort_values('email')\
.drop_duplicates(['fname', 'lname'])
If your empty rows are strings, you need to specify ascending=False
when sorting: 如果空行是字符串,则在排序时需要指定
ascending=False
:
df = df.sort_values('email', ascending=False)\
.drop_duplicates(['fname', 'lname'])
print(df)
fname lname email
4 Bill Smith some2@some.com
1 Joe Aaron some@some.com
You can using first
with groupby
(Notice replace empty with np.nan, since the first
will return the first not null value for each columns) 您可以将
first
与groupby
一起使用(注意,请用np.nan替换为空,因为first
将返回每列的第一个非null值)
df.replace('',np.nan).groupby(['fname','lname']).first().reset_index()
Out[20]:
fname lname email
0 Bill Smith some2@some.com
1 Joe Aaron some@some.com
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.