简体   繁体   中英

Drop duplicated rows based on multiple columns if other column(s) is NaNs in Pandas

Given a test dataset as follows:

   id city district  quantity  price
0   1   bj       hd      12.0   23.0
1   2   bj       cy      23.0   45.0
2   3   bj       hd       NaN    NaN
3   4   bj       cy       NaN    NaN
4   5   sh       hp      12.0    NaN
5   6   sh       hp       NaN    NaN
6   7   sh       pd       NaN    NaN

I would like to drop duplicated rows based on city and district , then drop rows if its quantity is NaN , but if city and district are not duplicated, then even if quantity is NaN , it's not necessary to drop rows.

Code based on link from here :

m1 = df['quantity'].notna()
m2 = ~df[['city', 'district']].duplicated()

df1 = df[m1 & m2]
print(df1)

Out:

   id city district  quantity  price
0   1   bj       hd      12.0   23.0
1   2   bj       cy      23.0   45.0
4   5   sh       hp      12.0    NaN

But I want to keep the last row since it's not duplicated with any rows. How could I do that?

   id city district  quantity  price
0   1   bj       hd      12.0   23.0
1   2   bj       cy      23.0   45.0
2   5   sh       hp      12.0    NaN
3   7   sh       pd       NaN    NaN

You can use:

m1 = df['quantity'].notna()
m2 = ~df[['city', 'district']].duplicated()
m3 = ~df[['city', 'district']].duplicated(keep=False)

df1 = df[(m1 & m2) | (~m1 & m3)]
print(df1)
   id city district  quantity  price
0   1   bj       hd      12.0   23.0
1   2   bj       cy      23.0   45.0
4   5   sh       hp      12.0    NaN
6   7   sh       pd       NaN    NaN

According to your logic, it seems like you want to drop columns where:

  1. quantity is NaN => m1 = df['quantity'].isna() AND
  2. 'city', 'district' is duplicated => m2 = df[['city', 'district']].duplicated(keep=False)

And since you would like to take all columns except which meet the above conditions:

>>> m1 = df['quantity'].isna()
>>> m2 = df[['city', 'district']].duplicated(keep=False)
>>> df[~(m1 & m2)]

   id city district  quantity  price
0   1   bj       hd      12.0   23.0
1   2   bj       cy      23.0   45.0
4   5   sh       hp      12.0    NaN
6   7   sh       pd       NaN    NaN

And your original code would work with keep=False and | (or) operator.

>>> m1 = df['quantity'].notna()
>>> m2 = ~df[['city', 'district']].duplicated(keep=False)
>>> df[m1 | m2]
   id city district  quantity  price
0   1   bj       hd      12.0   23.0
1   2   bj       cy      23.0   45.0
4   5   sh       hp      12.0    NaN
6   7   sh       pd       NaN    NaN

EDIT

Based on your comments, if the df is:

>>> df
   id city district  quantity  price
0   1   bj       hd      12.0   23.0
1   2   bj       cy      23.0   45.0
2   3   bj       hd       NaN    NaN
3   4   bj       cy       NaN    NaN
4   8   sh       hp      14.0   15.0
5   8   sh       hp      14.0   16.0
6   7   sh       pd       NaN    NaN

# First drop duplicates with NaN items, with any of the above methods
>>> df[m1 | m2]
   id city district  quantity  price
0   1   bj       hd      12.0   23.0
1   2   bj       cy      23.0   45.0
4   8   sh       hp      14.0   15.0
5   8   sh       hp      14.0   16.0
6   7   sh       pd       NaN    NaN

# then drop duplicates with default condition:
>>> df[m1 | m2].drop_duplicates(['city', 'district'])
   id city district  quantity  price
0   1   bj       hd      12.0   23.0
1   2   bj       cy      23.0   45.0
4   8   sh       hp      14.0   15.0
6   7   sh       pd       NaN    NaN

You can change keep parameter of drop_duplicates to control default behavior, ie whether to keep the first duplicate or the last .

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM