[英]Overload arithmetic & comparison operators for element-wise list operations in Python
[英]Nested for-loop element-wise list comparison
作为解决这里描述的挑战的一种新颖方法,我总结了以下内容:
from difflib import SequenceMatcher
def similar(a, b):
return SequenceMatcher(None, a, b).ratio()
diffs =[
"""- It contains a Title II provision that changes the age at which workers
compensation/public disability offset ends for disability beneficiaries from age 65 to full retirement age (FRA).""",
"""+ It contains a Title II provision that changes the age at which workers
compensation/public disability offset ends for disability beneficiaries from age 68 to full retirement age (FRA).""",
"""+ Here's a new paragraph I added for testing."""]
for s in diffs:
others = [i for i in diffs if i != s]
for j in others:
if similar(s, j) > 0.7:
print '"{}" and "{}" refer to the same sentence'.format(s, j)
print
diffs.remove(j)
else:
print '"{}" is a new sentence'.format(s)
这个想法是遍历字符串,并将它们彼此进行比较。 如果给定的字符串被认为与另一个相似,则删除另一个,否则给定的字符串被视为列表中的唯一字符串。
这是输出:
"- It contains a Title II provision that changes the age at which workers
compensation/public disability offset ends for disability beneficiaries from age 65 to full retirement age (FRA)." and "+ It contains a Title II provision that changes the age at which workers
compensation/public disability offset ends for disability beneficiaries from age 68 to full retirement age (FRA)." refer to the same sentence
"- It contains a Title II provision that changes the age at which workers
compensation/public disability offset ends for disability beneficiaries from age 65 to full retirement age (FRA)." is a new sentence
"+ Here's a new paragraph I added for testing." is a new sentence
因此,它可以正确地检测出前两个句子是相似的,而后一个是唯一的。 问题在于,然后返回并认为第一句话是唯一的(不是,这不是,而且无论如何也不应返回到此句子)。
我的循环逻辑的缺点在哪里? 没有嵌套才能实现这一目标for
S和移除元素?
from difflib import SequenceMatcher
from collections import defaultdict
def similar(a, b):
return SequenceMatcher(None, a, b).ratio()
diffs =[
"""- It contains a Title II provision that changes the age at which workers
compensation/public disability offset ends for disability beneficiaries from age 65 to full retirement age (FRA).""",
"""+ It contains a Title II provision that changes the age at which workers
compensation/public disability offset ends for disability beneficiaries from age 68 to full retirement age (FRA).""",
"""+ Here's a new paragraph I added for testing."""]
sims = set()
simdict = defaultdict(list)
for i in range(len(diffs)):
if i in sims:
continue
s = diffs[i]
for j in range(i+1, len(diffs)):
r = diffs[j]
if similar(s, r) > 0.7:
sims.add(j)
simdict[i].append(j)
for k, v in simdict.iteritems():
print diffs[k] + " is similar to:"
print '\n'.join(diffs[e] for e in v)
您可以通过更改以下内容确切地了解何时确定第一句话是唯一的
print '"{}" is a new sentence'.format(s)
至
print '"{}" and "{}" are different sentences'.format(s,j)
这应该可以帮助您查看循环到底在哪里失败。
由于修改后的字符串始终会背对背出现(一个字符串前面带有“-” ,另一个“ +”和“-”),因此可以完成以下操作(我相信它在所有情况下都可以使用)。
当列表中元素的数量为奇数时,最后一个必须为新句子。
def extract_modified_and_new(diffs):
for z1, z2 in zip(diffs[::2], diffs[1::2]):
if similar(z1, z2) > 0.7:
print z1, 'is similar to', z2
print
else:
print z1, ' and ', z2, 'are new'
print
if len(diffs) % 2 != 0:
print diffs[-1], ' is new'
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.