简体   繁体   中英

remove duplicates values in a column seperated by comma in Pandas

I want to remove duplicates in a column via Pandas. I tried df.drop_duplicates() but no luck. How to achieve this in Pandas?

Input :

A
team=red, Manager=Travis
team=Blue, Manager=John, team=Blue
Manager=David, Bank=HDFC, team=XYZ, Bank=HDFC

Expected_Output :

A
team=red, Manager=Travis
team=Blue, Manager=John
Manager=David, Bank=HDFC, team=XYZ

Code

df = df.drop_duplicates('A', keep='last')

You can use some data structures to achieve this result.

  1. split entries
  2. convert to set (or some non duplicated structure)
  3. join back to string
print(df['A'])
0                         team=red, Manager=Travis
1               team=Blue, Manager=John, team=Blue
2    Manager=David, Bank=HDFC, team=XYZ, Bank=HDFC
Name: A, dtype: object


out = (
    df['A'].str.split(r',\s+')
    .map(set)
    .str.join(", ")
)

print(out)
0              Manager=Travis, team=red
1               team=Blue, Manager=John
2    Bank=HDFC, team=XYZ, Manager=David
Name: A, dtype: object

Alternatively, if the order of your string entries is important, you can use dict.fromkeys instead of a set . Since dictionaries are implicitly ordered as of Py > 3.6

out = (
    df['A'].str.split(r',\s+')
    .map(dict.fromkeys)
    .str.join(", ")
)

print(out)
0              team=red, Manager=Travis
1               team=Blue, Manager=John
2    Manager=David, Bank=HDFC, team=XYZ
Name: A, dtype: object

Try:

df['A'].str.split(',').explode().str.strip(' ')\
       .drop_duplicates().groupby(level=0).agg(','.join)

Output:

0             team=red,Manager=Travis
1              team=Blue,Manager=John
2    Manager=David,Bank=HDFC,team=XYZ
Name: A, dtype: object

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM