简体   繁体   中英

Nested for-loop element-wise list comparison

As a novel approach to solving my challenge described here , I have put together the following:

from difflib import SequenceMatcher

def similar(a, b):
    return SequenceMatcher(None, a, b).ratio()

diffs =[
"""- It contains a Title II provision that changes the age at which workers
compensation/public disability offset ends for disability beneficiaries from age 65 to full retirement age (FRA).""",
"""+ It contains a Title II provision that changes the age at which workers 
compensation/public disability offset ends for disability beneficiaries from age 68 to full retirement age (FRA).""",
"""+ Here's a new paragraph I added for testing."""]

for s in diffs:
    others = [i for i in diffs if i != s]
    for j in others:
        if similar(s, j) > 0.7:
            print '"{}" and "{}" refer to the same sentence'.format(s, j)
            print
            diffs.remove(j)
        else:
            print '"{}" is a new sentence'.format(s)

The idea is to loop over the strings, and compare each with the others. If a given string is deemed to be similar to another, remove the other, otherwise the given string is deemed to be a unique string in the list.

Here's the output:

"- It contains a Title II provision that changes the age at which workers
compensation/public disability offset ends for disability beneficiaries from age 65 to full retirement age (FRA)." and "+ It contains a Title II provision that changes the age at which workers 
compensation/public disability offset ends for disability beneficiaries from age 68 to full retirement age (FRA)." refer to the same sentence


"- It contains a Title II provision that changes the age at which workers
compensation/public disability offset ends for disability beneficiaries from age 65 to full retirement age (FRA)." is a new sentence
"+ Here's a new paragraph I added for testing." is a new sentence

So it's correctly detecting that the first two sentences are similar, and that the last is unique. The problem is it's then going back and deeming the first sentence to be unique (which it isn't, and it should not be returning to this sentence anyway).

Where's the flaw in my looping logic? Can this be achieved without nested for s and removal of elements?

from difflib import SequenceMatcher
from collections import defaultdict

def similar(a, b):
    return SequenceMatcher(None, a, b).ratio()

diffs =[
"""- It contains a Title II provision that changes the age at which workers
compensation/public disability offset ends for disability beneficiaries from age 65 to full retirement age (FRA).""",
"""+ It contains a Title II provision that changes the age at which workers 
compensation/public disability offset ends for disability beneficiaries from age 68 to full retirement age (FRA).""",
"""+ Here's a new paragraph I added for testing."""]


sims = set()
simdict = defaultdict(list)
for i in range(len(diffs)):
    if i in sims:
        continue
    s = diffs[i]

    for j in range(i+1, len(diffs)):
        r = diffs[j]
        if similar(s, r) > 0.7:
            sims.add(j)
            simdict[i].append(j)


for k, v in simdict.iteritems():
    print diffs[k] + " is similar to:"
    print '\n'.join(diffs[e] for e in v)

You can see exactly when it determines the first sentence is unique by changing

print '"{}" is a new sentence'.format(s)

to

print '"{}" and "{}" are different sentences'.format(s,j)

This should help you to see where exactly your loop fails.

Since modified strings will always appear back-to-back (one with preceded with '-' , the other '+' , and '-', the following can be done (and I believe it will work in all cases).

When the list has an odd number of elements, the last must necessarily be a new sentence.

def extract_modified_and_new(diffs):
    for z1, z2 in zip(diffs[::2], diffs[1::2]):
        if similar(z1, z2) > 0.7:
            print z1, 'is similar to', z2
            print
        else:
            print z1, ' and ', z2, 'are new'
            print
    if len(diffs) % 2 != 0:
            print diffs[-1], ' is new'

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM