[英]Most efficient way to remove duplicates from Python list while preserving order and removing the oldest element
[英]Remove duplicates with pandas while preserving the order [python]
我的 df 中有一个列,我需要从中删除区分大小写的重复项,以保留第一次出现。 问题是我可能在某些行上有用“,”分隔的单词,或者在它们之间包含“-”。 有没有办法清理这些数据同时保留订单?
this is how my data looks like
3sprouts Cesto de Roupa Cisne Sprouts, 3Sprouts, Organizador
Bright-Starts Mordedor Chocalho Rattle & Teethe, bright Starts, Rosa/Roxo
Bright-Starts Mordedor Twist & Teethe, Starts, Multicor
#this is how it should look like
3sprouts Cesto de Roupa Cisne, Organizador
Bright-Starts Mordedor Chocalho Rattle & Teethe, Rosa/Roxo
Bright-Starts Mordedor Twist & Teethe, Multicor
提前谢谢了
假设:
-
不会被删除。一些想法:
.lower()
进行比较。-
存在,则拆分单词,然后剥离,
以进行比较import re
import itertools
sentences = [
'3sprouts Cesto de Roupa Cisne Sprouts, 3Sprouts, Organizador',
'Bright-Starts Mordedor Chocalho Rattle & Teethe, bright Starts, Rosa/Roxo',
'Bright-Starts Mordedor Twist & Teethe, Starts, Multicor'
]
for s in sentences:
s_split = s.split(' ') #keep original sentence split by ' '
s_split_without_comma = [i.strip(',') for i in s_split]
#get compare word split by '-' and ' ', use re or itertools
#method 1: re
compare_words = re.split(' |-', s)
#method 2: itertools
compare_words = list(itertools.chain.from_iterable([i.split('-') for i in s_split]))
#method 3: DIY
compare_words = []
for i in s_split:
compare_words += i.split('-')
# strip ','
compare_words_without_comma = [i.strip(',') for i in compare_words]
# start to compare
need_removed_index = []
for word in compare_words_without_comma:
matched_indexes = []
for idx, w in enumerate(s_split_without_comma):
if word.lower() in w.lower().split('-'):
matched_indexes.append(idx)
if len(matched_indexes) >1: #has_duplicates
need_removed_index += matched_indexes[1:]
need_removed_index = list(set(need_removed_index))
# keep remain and join with ' '
print(" ".join([i for idx, i in enumerate(s_split) if idx not in need_removed_index]))
灵魂打印:
3sprouts Cesto de Roupa Cisne Sprouts, Organizador
Bright-Starts Mordedor Chocalho Rattle & Teethe, Rosa/Roxo
Bright-Starts Mordedor Twist & Teethe, Multicor
与答案相比,它有点不同,但我仍然无法弄清楚为什么Sprouts
也在第 1 行中被删除(“3sprouts”匹配“sprouts”??)
没关系...只是给出一些概念。
供参考。
#sample dataframe used by me for testing:
df=pd.DataFrame({'col': {0: '3sprouts Cesto de Roupa Cisne Sprouts, 3Sprouts, Organizador',
1: 'Bright-Starts Mordedor Chocalho Rattle & Teethe, bright Starts, Rosa/Roxo',
2: 'Bright-Starts Mordedor Twist & Teethe, Starts, Multicor'}})
尝试:
out=df['col'].str.title().str.split(', ',expand=True)
#For checking purpose
real=df['col'].str.split(', ',expand=True)
#for assigning purpose
real[1]=real[1].mask(out[0].str.contains(f'({"|".join(out[1])})'))
#checking if value in col 0 of out is present in the col 1 of out and passing that mask to real
real[2]=real[2].mask(out[0].str.contains(f'({"|".join(out[2])})'))
#checking if value in col 0 of out is present in the col 2 of out and passing that mask to real
df['col']=real.apply(lambda x:', '.join(x.dropna()),1)
#finally joining values by ', '
df
输出:
col
0 3sprouts Cesto de Roupa Cisne Sprouts, Organizador
1 Bright-Starts Mordedor Chocalho Rattle & Teethe, Rosa/Roxo
2 Bright-Starts Mordedor Twist & Teethe, Multicor
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.