简体繁体中英

Pandas, drop duplicates across multiple columns only if None in other column

原文 2021-03-26 08:22:18 4 1 python/ pandas/ duplicates

For a large dataset (>800 000 records), need to find duplicates across multiple columns but delete if only it has None in a separate column.

For example in this case we searching for duplicates by subset=['Col2', 'Col3', 'Col4'] and checking None in Col1:


+------+------+------+------+
| Col1 | Col2 | Col3 | Col4 |
+------+------+------+------+
| None | a    |    2 |    1 |
| i1   | a    |    2 |    1 |
| i2   | v    |    7 |    5 |
| i3   | b    |    1 |    3 |
| None | c    |    2 |    2 |
| i4   | b    |    1 |    3 |
+------+------+------+------+

We should remove only first row.

1 answers

Use Series.notna chained by | for bitwise Or with all duplicates by keep=False by columns in DataFrame.duplicated :

df = df[df['Col1'].notna() | ~df.duplicated(subset=['Col2', 'Col3', 'Col4'], keep=False) ]
print (df)
   Col1 Col2  Col3  Col4
1    i1    a     2     1
2    i2    v     7     5
3    i3    b     1     3
4  None    c     2     2
5    i4    b     1     3

Drop consecutive duplicates across multiple columns - Pandas

Pandas - Remove duplicates across multiple columns

Pandas, create new column based on other columns across multiple rows

Pandas explode and drop duplicates for multiple columns

How to drop duplicates in one column based on values in 2 other columns in DataFrame in Python Pandas?

pandas drop rows with duplicates in some columns relative to other columns

Pandas Drop Duplicates Across Groups

drop duplicates of one column based on duplicates of another column keeping the other column duplicates in pandas

Pandas drop_duplicates in any one columns and also in other csv

pandas drop_duplicates condition on two other columns values

暂无

暂无

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

Related Question Drop consecutive duplicates across multiple columns - Pandas Pandas - Remove duplicates across multiple columns Pandas, create new column based on other columns across multiple rows Pandas explode and drop duplicates for multiple columns How to drop duplicates in one column based on values in 2 other columns in DataFrame in Python Pandas? pandas drop rows with duplicates in some columns relative to other columns Pandas Drop Duplicates Across Groups drop duplicates of one column based on duplicates of another column keeping the other column duplicates in pandas Pandas drop_duplicates in any one columns and also in other csv pandas drop_duplicates condition on two other columns values

Related Tags

粤ICP备18138465号 © 2020-2024 STACKOOM.COM