简体   繁体   English

Python-比较两个列表时出现循环问题

[英]Python - Problem with loops while comparing two lists

I have a little problem, I am trying to compare 2 lists with words in it to establish a similarity percentage but here is the thing, if I have the same word 2 times in each lists, I get a falsied percentage. 我有一个小问题,我试图将2个列表中的单词进行比较以建立相似度百分比,但这就是问题,如果我在每个列表中有2次相同的单词,我得到的百分比是虚假的。

First I made this little script : 首先,我编写了这个小脚本:

data1 = ['test', 'super', 'class', 'test', 'boom']
data2 = ['test', 'super', 'class', 'test', 'boom']
res = 0
nb = (len(data1) + len(data2)) / 2
if data1 and data2 and nb != 0:
    for id1, item1 in enumerate(data1):
        for id2, item2 in enumerate(data2):
            if item1 == item2:
                res += 1 - abs(id1 - id2) / nb
    print(res / nb * 100)

The problem is that if i have 2 time the same word in the lists the percentage will be greater than 100%. 问题是,如果我在列表中有2次相同单词,则该百分比将大于100%。 So to counter that, i added a 'break' just after the line 'res += 1 - abs(id1 - id2) / nb' but the percentage is still falsified. 因此,为了解决这个问题,我在“ res + = 1-abs(id1-id2)/ nb”行之后添加了一个“ break”,但该百分比仍然是伪造的。

I hope you've understand my problem, thanks you for your help ! 希望您了解我的问题,谢谢您的帮助!

You can use difflib.SequenceMatcher instead to compare the similarity of two lists. 您可以使用difflib.SequenceMatcher来比较两个列表的相似性。 Try this : 尝试这个 :

from difflib import SequenceMatcher as sm
data1 = ['test', 'super', 'class', 'test', 'boom']
data2 = ['test', 'super', 'class', 'test', 'boom']
matching_percentage = sm(None, data1, data2).ratio() * 100

Output : 输出

100.0
data1 = ['test', 'super', 'class', 'test', 'boom']
data2 = ['test', 'super', 'class', 'test', 'boom']
from collections import defaultdict

dic1 =defaultdict(int)
dic2=defaultdict(int)

for i in data1:
    dic1[i]+=1

for i in data2:
    dic2[i]+=1

count = 0

for i in dic1:
    if i in dic2.keys():
        count+=abs(dic2[i]-dic1[i])


result =( (1-count/(len(data1)+len(data2))) *100)

output 产量

100.0

Try this code: 试试这个代码:

data1 = ['test', 'super', 'class', 'class', 'test', 'boom']
data2 = ['test', 'super', 'class', 'class', 'test', 'boom']
res = 0
nb = (len(data1) + len(data2)) / 2.0

def pos_iter(index, sz):
    yield index
    i1 = index - 1
    i2 = index + 1
    while i1 >=0 and i2 < sz:
        if i1 >= 0:
            yield i1
            i1 -=1
        if i2 < sz:
            yield i2
            i2 += 1
if data1 and data2 and nb != 0:
    for id1, item1 in enumerate(data1):
        for id2 in pos_iter(id1, len(data2)):
            item2 = data2[id2]
            if item1 == item2:
                res += max(0, 1 - abs(id1 - id2) / nb)
                break
    print(res / nb * 100)

Problem with your code is, that you look for matching word in second data2 always from beginning. 代码的问题是,您总是从头开始在第二个data2寻找匹配的单词。 Which will give you invalid values, if words repeat. 如果单词重复,这将给您无效的值。 You need to search always "around" position of word in data1 , because u want to find the closest one. 您需要始终在data1搜索单词的“周围”位置,因为您要查找最接近的位置。

Also you need break you've added, otherwise text with all the same words will go way above 1.0. 另外,您还需要中断添加,否则带有相同单词的文本将超过1.0。 Your nb variable needs to be double (or python2 will round division result). 您的nb变量需要为double(否则python2将对除法结果取整)。 And you should make sure 1 - abs(id1 - id2) / nb is greater than zero, hence i've added max(0, ...) . 并且您应该确保1 - abs(id1 - id2) / nb大于零,因此我添加了max(0, ...)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM