For a large dataset (>800 000 records), need to find duplicates across multiple columns but delete if only it has None in a separate column.
For example in this case we searching for duplicates by subset=['Col2', 'Col3', 'Col4'] and checking None in Col1:
+------+------+------+------+
| Col1 | Col2 | Col3 | Col4 |
+------+------+------+------+
| None | a | 2 | 1 |
| i1 | a | 2 | 1 |
| i2 | v | 7 | 5 |
| i3 | b | 1 | 3 |
| None | c | 2 | 2 |
| i4 | b | 1 | 3 |
+------+------+------+------+
We should remove only first row.
Use Series.notna
chained by |
for bitwise Or
with all duplicates by keep=False
by columns in DataFrame.duplicated
:
df = df[df['Col1'].notna() | ~df.duplicated(subset=['Col2', 'Col3', 'Col4'], keep=False) ]
print (df)
Col1 Col2 Col3 Col4
1 i1 a 2 1
2 i2 v 7 5
3 i3 b 1 3
4 None c 2 2
5 i4 b 1 3
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.