简体   繁体   English

如何优化两个元组列表的组合并删除其重复项?

[英]How to optimize the combination of 2 lists of tuples and remove their duplicates?

From here, How do I remove element from a list of tuple if the 2nd item in each tuple is a duplicate? 从这里开始, 如果每个元组中的第二项都是重复的如何从元组列表中删除元素? , I am able to remove the duplicate of the 2nd element in a tuple from 1 lists of tuple. ,我能够从1个元组列表中删除元组中第二个元素的重复项。

Let's say I have 2 lists of tuples: 假设我有2个元组列表:

alist = [(0.7897897,'this is a foo bar sentence'),
(0.653234, 'this is a foo bar sentence'),
(0.353234, 'this is a foo bar sentence'),
(0.325345, 'this is not really a foo bar'),
(0.323234, 'this is a foo bar sentence'),]

blist = [(0.64637,'this is a foo bar sentence'),
(0.534234, 'i am going to foo bar this sentence'),
(0.453234, 'this is a foo bar sentence'),
(0.323445, 'this is not really a foo bar')]

And I need to combine the score if the 2nd elements are the same (score_from_alist * score_from_blist) and achieve the desired output: 如果第二个元素相同(score_from_alist * score_from_blist),我需要合并分数,以实现所需的输出:

clist = [(0.51,'this is a foo bar sentence'), # 0.51 = 0.789 * 0.646
(0.201, 'this is not really a foo bar')] # 0.201  = 0.325 * 0.323

Currently, I'm achieving the clist by doing this, but it's taking 5+ seconds when my alist and blist have around 5500+ tuples, where the 2nd element has around 20-40 words each. 目前,我是通过这样做来实现clist的,但是当alist和blist包含大约5500个以上的元组时,这需要5秒钟以上的时间,其中第二个元素每个都包含20至40个单词。 Is there any way to make the following function faster? 有什么方法可以使以下功能更快?

def overlapMatches(alist, blist):
    start_time = time.time()
    clist = []
    overlap = set()
    for d in alist:
        for dn in blist:
            if d[1] == dn[1]:
                score = d[0]*dn[0]
                overlap.add((score,d[1]))
    for s in sorted(overlap, reverse=True)[:20]:
        clist.append((s[0],s[1]))
    print "overlapping matches takes", time.time() - start_time 
    return clist

I would use dictionaries/sets to both eliminate duplicates and provide fast lookups: 我将使用字典/集合来消除重复并提供快速查找:

alist = [(0.7897897,'this is a foo bar sentence'),
(0.653234, 'this is a foo bar sentence'),
(0.353234, 'this is a foo bar sentence'),
(0.325345, 'this is not really a foo bar'),
(0.323234, 'this is a foo bar sentence'),]

blist = [(0.64637,'this is a foo bar sentence'),
(0.534234, 'i am going to foo bar this sentence'),
(0.453234, 'this is a foo bar sentence'),
(0.323445, 'this is not really a foo bar')]

bdict = {k:v for v,k in reversed(blist)}
clist = []
cset = set()
for v,k in alist:
   if k not in cset:
      b = bdict.get(k, None)
      if b is not None:
        clist.append((v * b, k))
        cset.add(k)
print(clist)

Here, blist is used to eliminate all but the first appearance of each sentence, and to provide fast lookups by sentence. 在这里, blist用于消除每个句子的除首出现外的所有内容,并按句子提供快速查找。

If you don't care about the ordering of clist , you can simplify the structures somewhat: 如果您不关心clist的顺序,则可以在某种程度上简化结构:

bdict = {k:v for v,k in reversed(blist)}
cdict = {}
for v,k in alist:
   if k not in cdict:
      b = bdict.get(k, None)
      if b is not None:
        cdict[k] = v * b
print(list((k,v) for v,k in cdict.items()))

To leave the tuple with the highest 1st item in presence of duplicates within a single list assuming it is sorted in descending order by the 1st item in a tuple, and to combine the score from the two lists if corresponding 2nd items in a tuple are the same: 如果元组中的第一项按降序排序,则在单个列表中存在重复项的情况下,将元组中具有最高第一项的元组保留下来;如果元组中的对应第二项是相同:

# remove duplicates (take the 1st item among duplicates)
a, b = [{sentence: score for score, sentence in reversed(lst)}
        for lst in [alist, blist]]

# merge (leave only tuples that have common 2nd items (sentences))
clist = [(a[s]*b[s], s) for s in a.viewkeys() & b.viewkeys()]
clist.sort(reverse=True) # sort by (score, sentence) in descending order
print(clist)

Output: 输出:

[(0.510496368389, 'this is a foo bar sentence'),
 (0.10523121352499999, 'this is not really a foo bar')]

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM