Given a test dataset as follows:
id city district quantity price
0 1 bj hd 12.0 23.0
1 2 bj cy 23.0 45.0
2 3 bj hd NaN NaN
3 4 bj cy NaN NaN
4 5 sh hp 12.0 NaN
5 6 sh hp NaN NaN
6 7 sh pd NaN NaN
I would like to drop duplicated rows based on city
and district
, then drop rows if its quantity
is NaN
, but if city
and district
are not duplicated, then even if quantity
is NaN
, it's not necessary to drop rows.
Code based on link from here :
m1 = df['quantity'].notna()
m2 = ~df[['city', 'district']].duplicated()
df1 = df[m1 & m2]
print(df1)
Out:
id city district quantity price
0 1 bj hd 12.0 23.0
1 2 bj cy 23.0 45.0
4 5 sh hp 12.0 NaN
But I want to keep the last row since it's not duplicated with any rows. How could I do that?
id city district quantity price
0 1 bj hd 12.0 23.0
1 2 bj cy 23.0 45.0
2 5 sh hp 12.0 NaN
3 7 sh pd NaN NaN
You can use:
m1 = df['quantity'].notna()
m2 = ~df[['city', 'district']].duplicated()
m3 = ~df[['city', 'district']].duplicated(keep=False)
df1 = df[(m1 & m2) | (~m1 & m3)]
print(df1)
id city district quantity price
0 1 bj hd 12.0 23.0
1 2 bj cy 23.0 45.0
4 5 sh hp 12.0 NaN
6 7 sh pd NaN NaN
According to your logic, it seems like you want to drop columns where:
quantity
is NaN
=> m1 = df['quantity'].isna()
AND 'city', 'district'
is duplicated
=> m2 = df[['city', 'district']].duplicated(keep=False)
And since you would like to take all columns except which meet the above conditions:
>>> m1 = df['quantity'].isna()
>>> m2 = df[['city', 'district']].duplicated(keep=False)
>>> df[~(m1 & m2)]
id city district quantity price
0 1 bj hd 12.0 23.0
1 2 bj cy 23.0 45.0
4 5 sh hp 12.0 NaN
6 7 sh pd NaN NaN
And your original code would work with keep=False
and |
(or) operator.
>>> m1 = df['quantity'].notna()
>>> m2 = ~df[['city', 'district']].duplicated(keep=False)
>>> df[m1 | m2]
id city district quantity price
0 1 bj hd 12.0 23.0
1 2 bj cy 23.0 45.0
4 5 sh hp 12.0 NaN
6 7 sh pd NaN NaN
EDIT
Based on your comments, if the df is:
>>> df
id city district quantity price
0 1 bj hd 12.0 23.0
1 2 bj cy 23.0 45.0
2 3 bj hd NaN NaN
3 4 bj cy NaN NaN
4 8 sh hp 14.0 15.0
5 8 sh hp 14.0 16.0
6 7 sh pd NaN NaN
# First drop duplicates with NaN items, with any of the above methods
>>> df[m1 | m2]
id city district quantity price
0 1 bj hd 12.0 23.0
1 2 bj cy 23.0 45.0
4 8 sh hp 14.0 15.0
5 8 sh hp 14.0 16.0
6 7 sh pd NaN NaN
# then drop duplicates with default condition:
>>> df[m1 | m2].drop_duplicates(['city', 'district'])
id city district quantity price
0 1 bj hd 12.0 23.0
1 2 bj cy 23.0 45.0
4 8 sh hp 14.0 15.0
6 7 sh pd NaN NaN
You can change keep
parameter of drop_duplicates
to control default behavior, ie whether to keep the first
duplicate or the last
.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.