简体   繁体   中英

Dropping selected rows in Pandas with duplicated columns

Suppose I have a dataframe like this:

fname    lname     email

Joe      Aaron   
Joe      Aaron     some@some.com
Bill     Smith 
Bill     Smith
Bill     Smith     some2@some.com

Is there a terse and convenient way to drop rows where {fname, lname} is duplicated and email is blank?

You should first check whether your "empty" data is NaN or empty strings. If they are a mixture, you may need to modify the below logic.

If empty rows are NaN

Using pd.DataFrame.sort_values and pd.DataFrame.drop_duplicates :

df = df.sort_values('email')\
       .drop_duplicates(['fname', 'lname'])

If empty rows are strings

If your empty rows are strings, you need to specify ascending=False when sorting:

df = df.sort_values('email', ascending=False)\
       .drop_duplicates(['fname', 'lname'])

Result

print(df)

  fname  lname           email
4  Bill  Smith  some2@some.com
1   Joe  Aaron   some@some.com

You can using first with groupby (Notice replace empty with np.nan, since the first will return the first not null value for each columns)

df.replace('',np.nan).groupby(['fname','lname']).first().reset_index()
Out[20]: 
  fname  lname           email
0  Bill  Smith  some2@some.com
1   Joe  Aaron   some@some.com

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM