简体   繁体   中英

Keep first occurrence while removing duplicates in pandas

I need to remove case-sensitive duplicates keeping the first occurrence and maintaining the order of the sentence. This need to be done on each row of a column.

Initial format:                                        How the output should look:
col_sentence                                                 col_sentence
paper Plastic aluminum paper                                 paper Plastic aluminum 
paper Plastic aluminum Paper                                 paper Plastic aluminum 
Paper tin glass tin PAPER                                    Paper tin glass 
Paper tin glass Paper-tin                                    Paper tin glass

Is this possible to be done with python? I've created a function which works and removes duplicates but only by converting in lower and changing the order witch, is not feasible in my case.

string = "paper Plastic aluminum Paper"
set_string = list()
for s in string.split(' '):
    if s not in set_string:
        set_string.append(s)
    
string = ' '.join(set_string)
print(string)
#output paper Plastic aluminum Paper

Sample python program to keep 1 occurrence and remove others. You can create a function from this and apply it to every row/column.

Note: Requires python 3.7+ to ensure ordering.

import re

def unique_only(sentence):
    words = re.split('[\W]+', sentence)
    unique_words = {}
    for word in words:
        key = word.lower()
        if key not in unique_words:
            unique_words[key] = word
    words = unique_words.values()
    return ' '.join(words)

df.applymap(unique_only)

Example Input:

                   col_sentence
0  paper Plastic aluminum paper
1  paper Plastic aluminum Paper
2     Paper tin glass tin PAPER
3     Paper tin glass Paper-tin

Output:

             col_sentence
0  paper Plastic aluminum
1  paper Plastic aluminum
2         Paper tin glass
3         Paper tin glass

Assuming only "-" and " " are the word separators in your columns, try this:

def uniqueList(row):
    words = row.split(" ")
    unique = words[0]
    for w in words:
        if w.lower() not in unique.lower():
            unique = unique + " " + w
    return unique

data["col_sentence"].str.replace("-", " ").apply(uniqueList)

Edit (incorporating @im0j's suggestion): To avoid partial matching of strings (example: matching pap with paper ), change the function to the following:

def uniqueList_full(row):
    words = row.split(" ")
    unique = [words[0]]
    for w in words:
        if w.lower() not in [u.lower() for u in unique]:
            unique = unique + [w]
    return " ".join(unique)

Another would be to use OrderedDict

import pandas as pd
from collections import OrderedDict

df = pd.DataFrame(data = {'col_sentence':['paper Plastic aluminum paper','paper Plastic aluminum Paper','Paper tin glass tin PAPER','Paper tin glass Paper-tin']})
df.col_sentence.apply(lambda x: ' '.join(list(OrderedDict.fromkeys(x.replace('-', ' ').split()))))
0          paper Plastic aluminum
1    paper Plastic aluminum Paper
2           Paper tin glass PAPER
3                 Paper tin glass
Name: col_sentence, dtype: object

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM