簡體   English   中英

如何優化兩個元組列表的組合並刪除其重復項?

[英]How to optimize the combination of 2 lists of tuples and remove their duplicates?

從這里開始, 如果每個元組中的第二項都是重復的如何從元組列表中刪除元素? ,我能夠從1個元組列表中刪除元組中第二個元素的重復項。

假設我有2個元組列表:

alist = [(0.7897897,'this is a foo bar sentence'),
(0.653234, 'this is a foo bar sentence'),
(0.353234, 'this is a foo bar sentence'),
(0.325345, 'this is not really a foo bar'),
(0.323234, 'this is a foo bar sentence'),]

blist = [(0.64637,'this is a foo bar sentence'),
(0.534234, 'i am going to foo bar this sentence'),
(0.453234, 'this is a foo bar sentence'),
(0.323445, 'this is not really a foo bar')]

如果第二個元素相同(score_from_alist * score_from_blist),我需要合並分數,以實現所需的輸出:

clist = [(0.51,'this is a foo bar sentence'), # 0.51 = 0.789 * 0.646
(0.201, 'this is not really a foo bar')] # 0.201  = 0.325 * 0.323

目前,我是通過這樣做來實現clist的,但是當alist和blist包含大約5500個以上的元組時,這需要5秒鍾以上的時間,其中第二個元素每個都包含20至40個單詞。 有什么方法可以使以下功能更快?

def overlapMatches(alist, blist):
    start_time = time.time()
    clist = []
    overlap = set()
    for d in alist:
        for dn in blist:
            if d[1] == dn[1]:
                score = d[0]*dn[0]
                overlap.add((score,d[1]))
    for s in sorted(overlap, reverse=True)[:20]:
        clist.append((s[0],s[1]))
    print "overlapping matches takes", time.time() - start_time 
    return clist

我將使用字典/集合來消除重復並提供快速查找:

alist = [(0.7897897,'this is a foo bar sentence'),
(0.653234, 'this is a foo bar sentence'),
(0.353234, 'this is a foo bar sentence'),
(0.325345, 'this is not really a foo bar'),
(0.323234, 'this is a foo bar sentence'),]

blist = [(0.64637,'this is a foo bar sentence'),
(0.534234, 'i am going to foo bar this sentence'),
(0.453234, 'this is a foo bar sentence'),
(0.323445, 'this is not really a foo bar')]

bdict = {k:v for v,k in reversed(blist)}
clist = []
cset = set()
for v,k in alist:
   if k not in cset:
      b = bdict.get(k, None)
      if b is not None:
        clist.append((v * b, k))
        cset.add(k)
print(clist)

在這里, blist用於消除每個句子的除首出現外的所有內容,並按句子提供快速查找。

如果您不關心clist的順序,則可以在某種程度上簡化結構:

bdict = {k:v for v,k in reversed(blist)}
cdict = {}
for v,k in alist:
   if k not in cdict:
      b = bdict.get(k, None)
      if b is not None:
        cdict[k] = v * b
print(list((k,v) for v,k in cdict.items()))

如果元組中的第一項按降序排序,則在單個列表中存在重復項的情況下,將元組中具有最高第一項的元組保留下來;如果元組中的對應第二項是相同:

# remove duplicates (take the 1st item among duplicates)
a, b = [{sentence: score for score, sentence in reversed(lst)}
        for lst in [alist, blist]]

# merge (leave only tuples that have common 2nd items (sentences))
clist = [(a[s]*b[s], s) for s in a.viewkeys() & b.viewkeys()]
clist.sort(reverse=True) # sort by (score, sentence) in descending order
print(clist)

輸出:

[(0.510496368389, 'this is a foo bar sentence'),
 (0.10523121352499999, 'this is not really a foo bar')]

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM