繁体   English   中英

如何在字符串中搜索一长串模式

[英]How to search a string for a long list of patterns

我正在编写一个工具来索引文档。 我有一长串数百甚至数千个固定模式的列表。 例如,我的索引可能看起来像{"cat training":"p.27", "cat handling":"p.29", "cat":"p.15", "dog training":"p.62", "dog":"p.60"}等等。

现在我想在我的索引中搜索任何 substring 的所有实例的文本(为了论证,每个段落都是一个字符串)。 (在搜索过程中,我将按长度对键进行排序,如图所示,以便“cat training”在“cat”之前匹配)。

更复杂的是,我希望匹配发生在单词边界上。 即我不希望“catch”匹配“cat”。

有没有一种pythonic方法可以做到这一点? 我当前的解决方案是逐字扫描源字符串,然后尝试将字符串的开头与我的整个索引进行匹配。 它有效,但速度很慢。

Aho-Corasick 算法就是为了解决这类问题而开发的。

它用于回答之前关于匹配大量模式的Stack Overflow 问题。

Aho–Corasick 的Python 库。

修改单词边界的 Aho-Corasick 算法的过程

为了回馈社区,这是我在 Python 中实现的 Aho-Corasick。 我将此发布到公共领域。

class AhoCorasick(object):
  """Aho-Corasick algorithm. Searches a string for any of
  a number of substrings.

  Usage: Create a list or other iterator of (needle, value) pairs.
      aho_tree = AhoCorasick(needlevaluelist)
      results = aho_tree.findAll(haystack)
      for result in results:
        # Each result is a tuple: (index, length, needle, value)

  values can be literally anything.

  Author: Edward Falk
  """
  def __init__(self, patternlist=None):
    self.root = None
    if patternlist:
      self.buildStateMachine(patternlist)
  def buildStateMachine(self, patternlist):
    root = self.__buildTree(patternlist)
    queue = []
    for node in root.goto.itervalues():
      queue.append(node)
      node.fail = root
    while queue:
      rnode = queue.pop(0)
      for key, unode in rnode.goto.iteritems():
        queue.append(unode)
        fnode = rnode.fail
        while fnode != None and key not in fnode.goto:
          fnode = fnode.fail
        unode.fail = fnode.goto[key] if fnode else root
        unode.output += unode.fail.output
    return root
  def findAll(self, string, start=0):
    '''Search this string for items in the dictionary. Return a list of
    (index, len, key, value) tuples.'''
    node = self.root
    for i,ch in enumerate(string[start:]):
      while node is not None and ch not in node.goto:
        node = node.fail
      if node is None:
        node = self.root
        continue
      node = node.goto[ch]
      for word,value in node.output:
        l = len(word)
        yield (i-l+1, l, word, value)
  def __buildTree(self, patternlist):
    """patternlist is a list (or any iterator) of (string,value) pairs."""
    root = AhoCorasick.Node()
    for word,value in patternlist:
      node = root
      for ch in word:
        if ch not in node.goto:
          node.goto[ch] = AhoCorasick.Node()
        node = node.goto[ch]
      node.output.append((word,value))
    self.root = root
    return root

  class Node(object):
    '''Aho-Corasick algorithm. Each node represents a state in the
    state machine.'''
    def __init__(self):
      self.goto = {}        # Map input to next state
      self.fail = None      # Map state to next state when character doesn't match
      self.output = []      # Map state to all index entries for that state
    def __repr__(self):
      return '<Node: %d goto, %d output>' % \
        (len(self.goto), len(self.output))
    def dump(self, name, indent):
      print "%s%s: AhoCorasickNode: %d goto, output %s, fail=%s" % \
        ("  "*indent, name, len(self.goto), self.output, self.fail)
      for k,v in self.goto.iteritems():
        v.dump(k, indent+1)

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM