简体   繁体   中英

Python Dataframe Duplicates Removal based on combination of two columns

I have a pandas dataframe that contains duplicates, but not regular duplicates you can remove by using simple df.drop_duplicates . For example when combining columns 1 and 2, for the purpose of my work, AB and BA are the same, and the 5th row should be eliminated.

1 2
---
A B ===> AB
C D
E F
G H
B A ===> BA
I J
K L

I want to remove duplicates keeping only one of the first rows. This would lead to:

1 2
---
A B
C D
E F
G H
I J
K L

cannot figure out how to do that. any help would be appreciated.

Update: (adding another column with numbers)

1 2  3
-------
A B 0.2 ===> AB
C D 0.1
E F 0.0
G H 0.5
B A 0.2 ===> BA
I J 0.3
K L 0.6

One solution could be:

import pandas as pd

data = {'1': {0: 'A', 1: 'C', 2: 'E', 3: 'G', 4: 'B', 5: 'I', 6: 'K'}, 
        '2': {0: 'B', 1: 'D', 2: 'F', 3: 'H', 4: 'A', 5: 'J', 6: 'L'},
        '3': {0: 0.2, 1: 0.1, 2: 0.0, 3: 0.5, 4: 0.2, 5: 0.3, 6: 0.6}
        }

df = pd.DataFrame(data)

out = df.loc[~df[['1','2']].apply(sorted, axis=1).duplicated()]
print(out)

   1  2    3
0  A  B  0.2
1  C  D  0.1
2  E  F  0.0
3  G  H  0.5
5  I  J  0.3
6  K  L  0.6

Above solution assumes that you want to get rid of "duplicates" based on column 1 and 2 . However, if you want to look for "duplicates" on the entire df including col 3 , you'll need to convert all values to the same dtype (ie strings) first. So, in that case, you might do:

out = df.loc[~df.astype(str).apply(sorted, axis=1).duplicated()]
print(out)

   1  2    3
0  A  B  0.2
1  C  D  0.1
2  E  F  0.0
3  G  H  0.5
5  I  J  0.3
6  K  L  0.6

NB This will not affect the dtype of col 3 in the actual result of course; it still contains float values.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM