嵌套的for循环逐元素列表比较

Question

作为解决这里描述的挑战的一种新颖方法，我总结了以下内容：

from difflib import SequenceMatcher

def similar(a, b):
    return SequenceMatcher(None, a, b).ratio()

diffs =[
"""- It contains a Title II provision that changes the age at which workers
compensation/public disability offset ends for disability beneficiaries from age 65 to full retirement age (FRA).""",
"""+ It contains a Title II provision that changes the age at which workers 
compensation/public disability offset ends for disability beneficiaries from age 68 to full retirement age (FRA).""",
"""+ Here's a new paragraph I added for testing."""]

for s in diffs:
    others = [i for i in diffs if i != s]
    for j in others:
        if similar(s, j) > 0.7:
            print '"{}" and "{}" refer to the same sentence'.format(s, j)
            print
            diffs.remove(j)
        else:
            print '"{}" is a new sentence'.format(s)

这个想法是遍历字符串，并将它们彼此进行比较。 如果给定的字符串被认为与另一个相似，则删除另一个，否则给定的字符串被视为列表中的唯一字符串。

这是输出：

"- It contains a Title II provision that changes the age at which workers
compensation/public disability offset ends for disability beneficiaries from age 65 to full retirement age (FRA)." and "+ It contains a Title II provision that changes the age at which workers 
compensation/public disability offset ends for disability beneficiaries from age 68 to full retirement age (FRA)." refer to the same sentence


"- It contains a Title II provision that changes the age at which workers
compensation/public disability offset ends for disability beneficiaries from age 65 to full retirement age (FRA)." is a new sentence
"+ Here's a new paragraph I added for testing." is a new sentence

因此，它可以正确地检测出前两个句子是相似的，而后一个是唯一的。 问题在于，然后返回并认为第一句话是唯一的（不是，这不是，而且无论如何也不应返回到此句子）。

我的循环逻辑的缺点在哪里？ 没有嵌套才能实现这一目标for S和移除元素？

Answer 1

from difflib import SequenceMatcher
from collections import defaultdict

def similar(a, b):
    return SequenceMatcher(None, a, b).ratio()

diffs =[
"""- It contains a Title II provision that changes the age at which workers
compensation/public disability offset ends for disability beneficiaries from age 65 to full retirement age (FRA).""",
"""+ It contains a Title II provision that changes the age at which workers 
compensation/public disability offset ends for disability beneficiaries from age 68 to full retirement age (FRA).""",
"""+ Here's a new paragraph I added for testing."""]


sims = set()
simdict = defaultdict(list)
for i in range(len(diffs)):
    if i in sims:
        continue
    s = diffs[i]

    for j in range(i+1, len(diffs)):
        r = diffs[j]
        if similar(s, r) > 0.7:
            sims.add(j)
            simdict[i].append(j)


for k, v in simdict.iteritems():
    print diffs[k] + " is similar to:"
    print '\n'.join(diffs[e] for e in v)

Answer 2

您可以通过更改以下内容确切地了解何时确定第一句话是唯一的

print '"{}" is a new sentence'.format(s)

至

print '"{}" and "{}" are different sentences'.format(s,j)

这应该可以帮助您查看循环到底在哪里失败。

Answer 3

由于修改后的字符串始终会背对背出现（一个字符串前面带有“-” ，另一个“ +”和“-”），因此可以完成以下操作（我相信它在所有情况下都可以使用）。

当列表中元素的数量为奇数时，最后一个必须为新句子。

def extract_modified_and_new(diffs):
    for z1, z2 in zip(diffs[::2], diffs[1::2]):
        if similar(z1, z2) > 0.7:
            print z1, 'is similar to', z2
            print
        else:
            print z1, ' and ', z2, 'are new'
            print
    if len(diffs) % 2 != 0:
            print diffs[-1], ' is new'

嵌套的for循环逐元素列表比较

问题描述

3 个解决方案

解决方案1
1 2016-02-19 21:44:44

解决方案2
0 2016-02-19 21:59:22

解决方案3
0 2016-02-20 04:22:13

嵌套的for循环逐元素列表比较

问题描述

3 个解决方案

解决方案1 1 2016-02-19 21:44:44

解决方案2 0 2016-02-19 21:59:22

解决方案3 0 2016-02-20 04:22:13

解决方案1
1 2016-02-19 21:44:44

解决方案2
0 2016-02-19 21:59:22

解决方案3
0 2016-02-20 04:22:13