简体   繁体   English

在单个列表中删除相似(但不相同)的字符串

[英]Remove similar(but not the same) strings in a single list

I have a list of strings that look like:我有一个看起来像的字符串列表:

my_list = ['https://www.google.com/', 'http://www.google.com/', 
           'https://www.google.com',  'http://www.google.com']

As you can see they are not the same but they all look very similar.如您所见,它们并不相同,但它们看起来都非常相似。

I also have a function which is:我还有一个功能是:

from fuzzywuzzy import fuzz

def similar(a, b):
    return fuzz.ratio(a,b)

I want to use this functions and say something like:我想使用这个函数并说一些类似的话:

for a,b in my_list:
    print (a,b)
    if similar(a,b) > 0.95:
        my_list.remove(b)

So I'm trying to remove similar looking strings from a list if they are above a certain similarity ratio.所以我试图从列表中删除相似的字符串,如果它们高于某个相似率。 I want to do this so that in this list I would end up with just the first url, in this case my_list would end up being:我想这样做,以便在这个列表中我最终只得到第一个 url,在这种情况下my_list最终会是:

my_list = ['https://www.google.com/']

After doing some googling, I found fuzzywuzzy has an inbuilt function which is pretty great.在进行了一些谷歌搜索之后,我发现 Fuzzywuzzy 有一个非常棒的内置功能。

from fuzzywuzzy.process import dedupe

deduped_list = list(dedupe(my_list, threshold=97, scorer=fuzz.ratio))

In general you should never use list.remove() within an iteration loop, because the list iterator will get confused when you remove an item from the same list you are iterating over.通常,您永远不应该在迭代循环中使用list.remove() ,因为当您从正在迭代的同一列表中删除项目时,列表迭代器会感到困惑。

And because you always want to keep the first item you can exclude it from the loop:并且因为您总是想保留第一项,所以您可以将其从循环中排除:

idx = 1
while idx < len(my_list):
    if similar(my_list[idx - 1], my_list[idx]) > 0.95:
        my_list.remove(my_list[idx])

print(my_list)

alternative solution with list comprehension具有列表理解的替代解决方案

first_item = my_list[0]
my_list = [first_item] + [item for item in my_list[1:] if similar(first_item, item) <= 0.95]

print(my_list)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM