简体   繁体   中英

Pandas, drop duplicates across multiple columns only if None in other column

For a large dataset (>800 000 records), need to find duplicates across multiple columns but delete if only it has None in a separate column.

For example in this case we searching for duplicates by subset=['Col2', 'Col3', 'Col4'] and checking None in Col1:


+------+------+------+------+
| Col1 | Col2 | Col3 | Col4 |
+------+------+------+------+
| None | a    |    2 |    1 |
| i1   | a    |    2 |    1 |
| i2   | v    |    7 |    5 |
| i3   | b    |    1 |    3 |
| None | c    |    2 |    2 |
| i4   | b    |    1 |    3 |
+------+------+------+------+

We should remove only first row.

Use Series.notna chained by | for bitwise Or with all duplicates by keep=False by columns in DataFrame.duplicated :

df = df[df['Col1'].notna() | ~df.duplicated(subset=['Col2', 'Col3', 'Col4'], keep=False) ]
print (df)
   Col1 Col2  Col3  Col4
1    i1    a     2     1
2    i2    v     7     5
3    i3    b     1     3
4  None    c     2     2
5    i4    b     1     3

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM