[英]How to search a string for a long list of patterns
我正在編寫一個工具來索引文檔。 我有一長串數百甚至數千個固定模式的列表。 例如,我的索引可能看起來像{"cat training":"p.27", "cat handling":"p.29", "cat":"p.15", "dog training":"p.62", "dog":"p.60"}
等等。
現在我想在我的索引中搜索任何 substring 的所有實例的文本(為了論證,每個段落都是一個字符串)。 (在搜索過程中,我將按長度對鍵進行排序,如圖所示,以便“cat training”在“cat”之前匹配)。
更復雜的是,我希望匹配發生在單詞邊界上。 即我不希望“catch”匹配“cat”。
有沒有一種pythonic方法可以做到這一點? 我當前的解決方案是逐字掃描源字符串,然后嘗試將字符串的開頭與我的整個索引進行匹配。 它有效,但速度很慢。
Aho-Corasick 算法就是為了解決這類問題而開發的。
它用於回答之前關於匹配大量模式的Stack Overflow 問題。
Aho–Corasick 的Python 庫。
修改單詞邊界的 Aho-Corasick 算法的過程
為了回饋社區,這是我在 Python 中實現的 Aho-Corasick。 我將此發布到公共領域。
class AhoCorasick(object):
"""Aho-Corasick algorithm. Searches a string for any of
a number of substrings.
Usage: Create a list or other iterator of (needle, value) pairs.
aho_tree = AhoCorasick(needlevaluelist)
results = aho_tree.findAll(haystack)
for result in results:
# Each result is a tuple: (index, length, needle, value)
values can be literally anything.
Author: Edward Falk
"""
def __init__(self, patternlist=None):
self.root = None
if patternlist:
self.buildStateMachine(patternlist)
def buildStateMachine(self, patternlist):
root = self.__buildTree(patternlist)
queue = []
for node in root.goto.itervalues():
queue.append(node)
node.fail = root
while queue:
rnode = queue.pop(0)
for key, unode in rnode.goto.iteritems():
queue.append(unode)
fnode = rnode.fail
while fnode != None and key not in fnode.goto:
fnode = fnode.fail
unode.fail = fnode.goto[key] if fnode else root
unode.output += unode.fail.output
return root
def findAll(self, string, start=0):
'''Search this string for items in the dictionary. Return a list of
(index, len, key, value) tuples.'''
node = self.root
for i,ch in enumerate(string[start:]):
while node is not None and ch not in node.goto:
node = node.fail
if node is None:
node = self.root
continue
node = node.goto[ch]
for word,value in node.output:
l = len(word)
yield (i-l+1, l, word, value)
def __buildTree(self, patternlist):
"""patternlist is a list (or any iterator) of (string,value) pairs."""
root = AhoCorasick.Node()
for word,value in patternlist:
node = root
for ch in word:
if ch not in node.goto:
node.goto[ch] = AhoCorasick.Node()
node = node.goto[ch]
node.output.append((word,value))
self.root = root
return root
class Node(object):
'''Aho-Corasick algorithm. Each node represents a state in the
state machine.'''
def __init__(self):
self.goto = {} # Map input to next state
self.fail = None # Map state to next state when character doesn't match
self.output = [] # Map state to all index entries for that state
def __repr__(self):
return '<Node: %d goto, %d output>' % \
(len(self.goto), len(self.output))
def dump(self, name, indent):
print "%s%s: AhoCorasickNode: %d goto, output %s, fail=%s" % \
(" "*indent, name, len(self.goto), self.output, self.fail)
for k,v in self.goto.iteritems():
v.dump(k, indent+1)
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.