简体   繁体   English

从字符串列表中删除相似的重复项

[英]Remove the similar Duplicates from list of strings

I'm trying to remove the similar duplicates from my list.我正在尝试从我的列表中删除类似的重复项。 Here is my code:这是我的代码:

l = ["shirt", "shirt", "shirt len", "pant", "pant cotton", "len pant", "watch"]

res = [*set(l)]
print(res)

This will Remove only shirt word which is actually duplicate, but I'm looking to remove the similar words to remove like shirt Len,pant cotton,Len pant.这将仅删除实际上重复的衬衫字词,但我希望删除类似的词以删除衬衫 Len、pant cotton、Len pant。 Like that.像那样。

Expecting Output as Shirt,pant,watch期待 Output 作为衬衫,裤子,手表

It sounds like you want to check if the single-word strings are in any other string, and if so remove them as a duplicate.听起来您想检查单个单词字符串是否在任何其他字符串中,如果是,则将它们作为重复项删除。 I would go about it this way:我会这样 go :

  • Separate the list into single-word strings and any other string.将列表分成单个单词字符串和任何其他字符串。
  • For each longer string, check if any of the single-word strings is contained in it.对于每个较长的字符串,检查其中是否包含任何单字字符串。
    • If so, remove it.如果是这样,请将其删除。 Otherwise, add it to the result.否则,将其添加到结果中。
  • Finally, add all the single-word strings to the result.最后,将所有单字串添加到结果中。
l = ["shirt", "shirt", "shirt len", "pant", "pant cotton", "len pant", "watch"]

single, longer = set(), set()
for s in l:
    if len(s.split()) == 1:
        single.add(s)
    else:
        longer.add(s)

res = set()
for s in longer:
    if not any(word in s for word in single):
        res.add(s)
res |= single

print(res)

This example will give:这个例子将给出:

{'shirt', 'watch', 'pant'}

You can try something like below:您可以尝试以下操作:

by selecting single word element from list and then apply set通过从列表中选择单个单词元素然后应用集合

lst = ["shirt", "shirt", "shirt len", "pant cotton", "len pant", "watch"]
set([ls for ls in lst if ' 'not in ls]) 
#Output {'pant', 'shirt', 'watch'}

note if your input will ["shirt", "shirt", "shirt len", "pant cotton", "len pant", "watch"] then output will be {'shirt', 'watch'}请注意,如果您输入["shirt", "shirt", "shirt len", "pant cotton", "len pant", "watch"]那么 output 将是{'shirt', 'watch'}

and if still would like to add pant, cotton then you can try如果还想加pant, cotton那么你可以试试

set(sum([ls.split(' ') for ls in lst], []))
#output {'cotton', 'len', 'pant', 'shirt', 'watch'}

and later filter out word by conditions as per your requirements然后根据您的要求按条件过滤掉单词

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM