[英]Remove duplicates with pandas while preserving the order [python]
I have a column in my df from which, I need to remove case sensitive duplicates keeping the first occurrence.我的 df 中有一个列,我需要从中删除区分大小写的重复项,以保留第一次出现。 The problem is that I may have on certain rows, words separated by ',' or containing '-' between them.问题是我可能在某些行上有用“,”分隔的单词,或者在它们之间包含“-”。 Is there a way to clean this data preserving the order in the same time?有没有办法清理这些数据同时保留订单?
this is how my data looks like
3sprouts Cesto de Roupa Cisne Sprouts, 3Sprouts, Organizador
Bright-Starts Mordedor Chocalho Rattle & Teethe, bright Starts, Rosa/Roxo
Bright-Starts Mordedor Twist & Teethe, Starts, Multicor
#this is how it should look like
3sprouts Cesto de Roupa Cisne, Organizador
Bright-Starts Mordedor Chocalho Rattle & Teethe, Rosa/Roxo
Bright-Starts Mordedor Twist & Teethe, Multicor
Many thanks in advance提前谢谢了
Assumption:假设:
-
will not be removed.单词包含-
不会被删除。Some ideas:一些想法:
.lower()
.区分大小写的重复项:在敏感的 IMO 中应该区分大小写,因此与.lower()
进行比较。-
exists, then strip ,
for comparasion由 ',' 分隔或在它们之间包含 '-' 的单词:如果-
存在,则拆分单词,然后剥离,
以进行比较import re
import itertools
sentences = [
'3sprouts Cesto de Roupa Cisne Sprouts, 3Sprouts, Organizador',
'Bright-Starts Mordedor Chocalho Rattle & Teethe, bright Starts, Rosa/Roxo',
'Bright-Starts Mordedor Twist & Teethe, Starts, Multicor'
]
for s in sentences:
s_split = s.split(' ') #keep original sentence split by ' '
s_split_without_comma = [i.strip(',') for i in s_split]
#get compare word split by '-' and ' ', use re or itertools
#method 1: re
compare_words = re.split(' |-', s)
#method 2: itertools
compare_words = list(itertools.chain.from_iterable([i.split('-') for i in s_split]))
#method 3: DIY
compare_words = []
for i in s_split:
compare_words += i.split('-')
# strip ','
compare_words_without_comma = [i.strip(',') for i in compare_words]
# start to compare
need_removed_index = []
for word in compare_words_without_comma:
matched_indexes = []
for idx, w in enumerate(s_split_without_comma):
if word.lower() in w.lower().split('-'):
matched_indexes.append(idx)
if len(matched_indexes) >1: #has_duplicates
need_removed_index += matched_indexes[1:]
need_removed_index = list(set(need_removed_index))
# keep remain and join with ' '
print(" ".join([i for idx, i in enumerate(s_split) if idx not in need_removed_index]))
sould print:灵魂打印:
3sprouts Cesto de Roupa Cisne Sprouts, Organizador
Bright-Starts Mordedor Chocalho Rattle & Teethe, Rosa/Roxo
Bright-Starts Mordedor Twist & Teethe, Multicor
It has a little different compared to answer, but I still can't figure out why Sprouts
also removed in row 1 ('3sprouts' matches 'sprouts'??)与答案相比,它有点不同,但我仍然无法弄清楚为什么Sprouts
也在第 1 行中被删除(“3sprouts”匹配“sprouts”??)
Never mind... just give some concepts.没关系...只是给出一些概念。
FYI.供参考。
#sample dataframe used by me for testing:
df=pd.DataFrame({'col': {0: '3sprouts Cesto de Roupa Cisne Sprouts, 3Sprouts, Organizador',
1: 'Bright-Starts Mordedor Chocalho Rattle & Teethe, bright Starts, Rosa/Roxo',
2: 'Bright-Starts Mordedor Twist & Teethe, Starts, Multicor'}})
Try:尝试:
out=df['col'].str.title().str.split(', ',expand=True)
#For checking purpose
real=df['col'].str.split(', ',expand=True)
#for assigning purpose
real[1]=real[1].mask(out[0].str.contains(f'({"|".join(out[1])})'))
#checking if value in col 0 of out is present in the col 1 of out and passing that mask to real
real[2]=real[2].mask(out[0].str.contains(f'({"|".join(out[2])})'))
#checking if value in col 0 of out is present in the col 2 of out and passing that mask to real
df['col']=real.apply(lambda x:', '.join(x.dropna()),1)
#finally joining values by ', '
Output of df
: df
输出:
col
0 3sprouts Cesto de Roupa Cisne Sprouts, Organizador
1 Bright-Starts Mordedor Chocalho Rattle & Teethe, Rosa/Roxo
2 Bright-Starts Mordedor Twist & Teethe, Multicor
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.