I need to remove case-sensitive duplicates keeping the first occurrence and maintaining the order of the sentence. This need to be done on each row of a column.
Initial format: How the output should look:
col_sentence col_sentence
paper Plastic aluminum paper paper Plastic aluminum
paper Plastic aluminum Paper paper Plastic aluminum
Paper tin glass tin PAPER Paper tin glass
Paper tin glass Paper-tin Paper tin glass
Is this possible to be done with python? I've created a function which works and removes duplicates but only by converting in lower and changing the order witch, is not feasible in my case.
string = "paper Plastic aluminum Paper"
set_string = list()
for s in string.split(' '):
if s not in set_string:
set_string.append(s)
string = ' '.join(set_string)
print(string)
#output paper Plastic aluminum Paper
Sample python program to keep 1 occurrence and remove others. You can create a function from this and apply it to every row/column.
Note: Requires python 3.7+ to ensure ordering.
import re
def unique_only(sentence):
words = re.split('[\W]+', sentence)
unique_words = {}
for word in words:
key = word.lower()
if key not in unique_words:
unique_words[key] = word
words = unique_words.values()
return ' '.join(words)
df.applymap(unique_only)
Example Input:
col_sentence
0 paper Plastic aluminum paper
1 paper Plastic aluminum Paper
2 Paper tin glass tin PAPER
3 Paper tin glass Paper-tin
Output:
col_sentence
0 paper Plastic aluminum
1 paper Plastic aluminum
2 Paper tin glass
3 Paper tin glass
Assuming only "-" and " " are the word separators in your columns, try this:
def uniqueList(row):
words = row.split(" ")
unique = words[0]
for w in words:
if w.lower() not in unique.lower():
unique = unique + " " + w
return unique
data["col_sentence"].str.replace("-", " ").apply(uniqueList)
Edit (incorporating @im0j's suggestion): To avoid partial matching of strings (example: matching pap
with paper
), change the function to the following:
def uniqueList_full(row):
words = row.split(" ")
unique = [words[0]]
for w in words:
if w.lower() not in [u.lower() for u in unique]:
unique = unique + [w]
return " ".join(unique)
Another would be to use OrderedDict
import pandas as pd
from collections import OrderedDict
df = pd.DataFrame(data = {'col_sentence':['paper Plastic aluminum paper','paper Plastic aluminum Paper','Paper tin glass tin PAPER','Paper tin glass Paper-tin']})
df.col_sentence.apply(lambda x: ' '.join(list(OrderedDict.fromkeys(x.replace('-', ' ').split()))))
0 paper Plastic aluminum
1 paper Plastic aluminum Paper
2 Paper tin glass PAPER
3 Paper tin glass
Name: col_sentence, dtype: object
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.