简体   繁体   中英

Remove similar items from a list

I have a list of words (nearly 7 items) and I want to remove items who are nearly identical to the other words (ie if my word is 'Agency Account Bank Agreement' I want to remove words like 'Agency Account Bank Agreement Pursuant').

To estimate if a word is close to another word I used the Jaro Distance from the jellyfish package in Python.

And my current code is:

corpus3 = ['Agency Account Bank Agreement', 'Agent', 'Agency Account Bank Agreement Pursuant',
       'Agency Account Bank Agreement Notwithstanding', 'Agents', 'Agent', 'Reinvestment Period']
threshold = 0,85
for a, b in itertools.combinations(corpus3, 2):
    if len(a.split()) >= 2 or len(b.split()) >= 2:               
        jf = jellyfish.jaro_distance(a, b)
        if jf > threshold:
            if a in new_corpus and b in new_corpus:                
                continue
            else:
                if len(a.strip()) < len(b.strip()):
                    kw = a
                    if not new_corpus:
                        new_corpus.append(a)
                    else:    
                        for item in new_corpus:
                            jf = jellyfish.jaro_distance(kw, item)
                            if jf < threshold:
                                new_corpus.append(kw)

And this is what I want at the end:

new_corpus = ['Agency Account Bank Agreement', 'Agent', 'Reinvestment Period']

Let's say you have this list:

numchars = ['one', 'ones', 'two', 'twos', 'three', 'threes']

Let's say you believe that one is too similar to ones for your taste, and you only want to keep one of the two, such that your revised list would be similar to this:

numchars = ['ones', 'twos', 'threes']

You could do this to eliminate the ones you deem too similar:

for x in numchars:
    if any(lower_threshold < jellyfish.jaro_distance(x, _x) and x != _x for _x in numchars):
        numchars.remove(x)

Depending on the thresholds you set, as well as the order of your list, this could produce results like this:

numchars = ['ones', 'twos', 'threes']

The main logic in this routine is in this line:

if any(lower_threshold < jellyfish.jaro_distance(x, _x) and x != _x for _x in numchars):

This says if any member of the list numchars , when compared to all members of that list exluding itself, has a similarity rating greater than your defined lower_threshold , that member should be removed from the list, as such: numchars.remove(x) . Also, the and x != _x condition avoids registering a member as being too similar to itself.

But the meat of this sandwich, so to speak, is in this line:

numchars.remove(x)

This statement ensures that once you remove one for being too similar to ones , that during the next iteration one isn't a member of the list anymore and isn't compared to ones in such a way that would remove ones as well. That approach would end up resulting in an empty list.

Once you start wanting to only keep pluralizations, or other certain forms of similar match-groups, you open a whole other can of worms.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM