[英]Efficient way to compare two lists of different lengths - Python
我正在嘗試將list_A:60個元素與list_B:〜300,000個元素進行比較,並返回對list_B中每個元素出現的list_A中元素數量的計數(以列表形式)。
我的列表如下所示:
list_A = ['CAT - cats are great', 'DOG - dogs are great too']
list_B = ['CAT - cats are great(A)DOG - dogs are great too(B)', 'DOG - dogs are great too(B)']
我希望我的計數返回: [2, 1]
我的實現可以用,但是它包含一個嵌套的for循環,導致運行時間長。
list = []
for i in range(len(list_B)):
count = 0
for j in range(len(list_A)):
if (list_A[j] in list_B[i]):
count += 1
list.append(count)
return list
任何幫助將非常感激! 謝謝 :)
由於您正在尋找子字符串,因此我認為沒有任何方法可以對其進行優化。 不過,您可以使用列表推導和sum()
簡化代碼。
result = [sum(phrase in sentence for phrase in list_A) for sentence in list_B]
如果您事先知道list_A
,或者只需要運行一次,則@Barmar的答案是快速正確的。 如果不是這種情況,您可以考慮使用以下方法(它也應該很快,但是步驟更多)。
import collections
def count(target, summaries):
return [sum(s[t] for t in target) for s in summaries]
mines = ['aa', 'ab', 'abc', 'aabc']
summaries = [collections.Counter(m) for m in mines]
gold = ['a', 'b']
silver = ['c']
assert count(gold, summaries) == [2, 2, 2, 3]
assert count(silver, summaries) == [0, 0, 1, 1]
還值得注意的是,如果您查看的是60/300000,則此玩具示例中可能缺少一些提速和簡化的功能,例如,如果60是1-60或字母數字等,則也可能是不匹配的值的數量是如此之小,以至於更容易計算並從長度中刪除。
我之前實際上已經回答了一個幾乎相同的問題,可以在這里找到: https : //stackoverflow.com/a/55914487/2284490唯一的區別是您想知道len(matches)
而不是any(matches)
算法。
可以有效解決Aho Corasick算法的變化
這是一種高效的字典匹配算法,可以同時在O(p + q + r)
中定位文本中的模式,其中p
=模式長度, q
=文本長度, r
=返回的匹配長度。
您可能需要同時運行兩個單獨的狀態機,並且需要對其進行修改,以便它們在第一個匹配項時終止。
我從此python實現開始對修改進行了測試
class AhoNode(object):
def __init__(self):
self.goto = {}
self.count = 0
self.fail = None
def aho_create_forest(patterns):
root = AhoNode()
for path in patterns:
node = root
for symbol in path:
node = node.goto.setdefault(symbol, AhoNode())
node.count += 1
return root
def aho_create_statemachine(patterns):
root = aho_create_forest(patterns)
queue = []
for node in root.goto.itervalues():
queue.append(node)
node.fail = root
while queue:
rnode = queue.pop(0)
for key, unode in rnode.goto.iteritems():
queue.append(unode)
fnode = rnode.fail
while fnode is not None and key not in fnode.goto:
fnode = fnode.fail
unode.fail = fnode.goto[key] if fnode else root
unode.count += unode.fail.count
return root
def aho_count_all(s, root):
total = 0
node = root
for i, c in enumerate(s):
while node is not None and c not in node.goto:
node = node.fail
if node is None:
node = root
continue
node = node.goto[c]
total += node.count
return total
def pattern_counter(patterns):
''' Returns an efficient counter function that takes a string
and returns the number of patterns matched within it
'''
machine = aho_create_statemachine(patterns)
def counter(text):
return aho_count_all(text, machine)
return counter
並使用它
patterns = ['CAT - cats are great', 'DOG - dogs are great too']
counter = pattern_counter(patterns)
text_list = ['CAT - cats are great(A)DOG - dogs are great too(B)',
'DOG - dogs are great too(B)']
for text in text_list:
print '%r - %s' % (text, counter(text))
顯示
'CAT - cats are great(A)DOG - dogs are great too(B)' - 2
'DOG - dogs are great too(B)' - 1
請注意,此解決方案分別計算每個匹配項,因此在“ aba”中查找“ a”和“ b”將得出3。如果每個模式只需要一個匹配項,則需要跟蹤所有看到的模式,因此需要對次要模式進行跟蹤修改以將整數轉換為集合:
- self.count = 0
+ self.seen = set()
...
- node.count += 1
+ node.seen.add(path)
...
- unode.count += unode.fail.count
+ unode.seen |= unode.fail.seen
...
- total = 0
+ all_seen = set()
- total += node.count
+ all_seen |= node.seen
- return total
+ return len(all_seen)
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.