简体   繁体   English

从列表中删除类似的项目

[英]Remove similar items from a list

I have a list of words (nearly 7 items) and I want to remove items who are nearly identical to the other words (ie if my word is 'Agency Account Bank Agreement' I want to remove words like 'Agency Account Bank Agreement Pursuant'). 我有一个单词列表(将近7个项目),并且我想删除与其他单词几乎相同的项目(即,如果我的单词是“代理帐户银行协议”,我想删除类似“代理帐户银行协议” )。

To estimate if a word is close to another word I used the Jaro Distance from the jellyfish package in Python. 为了估计一个单词是否接近另一个单词,我使用了Python中来自水母包的Jaro距离。

And my current code is: 我当前的代码是:

corpus3 = ['Agency Account Bank Agreement', 'Agent', 'Agency Account Bank Agreement Pursuant',
       'Agency Account Bank Agreement Notwithstanding', 'Agents', 'Agent', 'Reinvestment Period']
threshold = 0,85
for a, b in itertools.combinations(corpus3, 2):
    if len(a.split()) >= 2 or len(b.split()) >= 2:               
        jf = jellyfish.jaro_distance(a, b)
        if jf > threshold:
            if a in new_corpus and b in new_corpus:                
                continue
            else:
                if len(a.strip()) < len(b.strip()):
                    kw = a
                    if not new_corpus:
                        new_corpus.append(a)
                    else:    
                        for item in new_corpus:
                            jf = jellyfish.jaro_distance(kw, item)
                            if jf < threshold:
                                new_corpus.append(kw)

And this is what I want at the end: 这就是我最后想要的:

new_corpus = ['Agency Account Bank Agreement', 'Agent', 'Reinvestment Period']

Let's say you have this list: 假设您有以下列表:

numchars = ['one', 'ones', 'two', 'twos', 'three', 'threes']

Let's say you believe that one is too similar to ones for your taste, and you only want to keep one of the two, such that your revised list would be similar to this: 比方说,你相信, one是太相似了ones为你的口味,你只希望保留这两个中的一个,这样,你的修订名单将类似于此:

numchars = ['ones', 'twos', 'threes']

You could do this to eliminate the ones you deem too similar: 您可以这样做以消除您认为过于相似的内容:

for x in numchars:
    if any(lower_threshold < jellyfish.jaro_distance(x, _x) and x != _x for _x in numchars):
        numchars.remove(x)

Depending on the thresholds you set, as well as the order of your list, this could produce results like this: 根据您设置的阈值以及列表的顺序,这可能会产生如下结果:

numchars = ['ones', 'twos', 'threes']

The main logic in this routine is in this line: 此例程中的主要逻辑在此行中:

if any(lower_threshold < jellyfish.jaro_distance(x, _x) and x != _x for _x in numchars):

This says if any member of the list numchars , when compared to all members of that list exluding itself, has a similarity rating greater than your defined lower_threshold , that member should be removed from the list, as such: numchars.remove(x) . 这表示,如果列表numchars任何成员与该列表中numchars其自身的所有成员相比,相似度都大于您定义的lower_threshold ,则应从列表中删除该成员,例如numchars.remove(x) Also, the and x != _x condition avoids registering a member as being too similar to itself. 同样, and x != _x条件避免将成员注册为and x != _x自身过于相似。

But the meat of this sandwich, so to speak, is in this line: 但是可以这么说,这种三明治的肉是这样的:

numchars.remove(x)

This statement ensures that once you remove one for being too similar to ones , that during the next iteration one isn't a member of the list anymore and isn't compared to ones in such a way that would remove ones as well. 此语句确保一旦你删除one为是太相似ones ,下一次迭代中one不在列表中的一员了,并不比ones以这样一种方式,将消除ones为好。 That approach would end up resulting in an empty list. 这种方法最终将导致一个空列表。

Once you start wanting to only keep pluralizations, or other certain forms of similar match-groups, you open a whole other can of worms. 一旦开始只希望保留复数形式或其他某些形式的相似匹配组,就可以打开另一整个蠕虫罐。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM