查找大文本中许多字符串的第一个匹配索引的最快方法

Question

我已经在python 2.7.13中实现了快速查找算法。 它可以满足我的要求，但性能问题很小。 这些是我的算法特性：

我的文本是HTML文章，通常介于5000到50000个字符之间，但可以多达30万个字符。
我有一个“单词”列表，其中可以包含特殊字符（é，à，ø，/ ...）和空格，通常为数百到数千个单词。 单词长度为2到256个字符。
我需要忽略HTML标记中包含的找到的项目
我需要比赛文字中的索引
我只需要每个单词的第一个匹配项

我有这个实现：

def find_indexes(text, words):
    words_indexes = []
    found_words = []
    authorized_characters = [u' ', u'.', u':', u';', u'?', u'!', u'¿', u'¡', u'…', u'(', u')']

    text_length = len(text)

    for j, word in enumerate(words):
        i = 0 

        # This loop serves to go to the next word find if the first one isn't valid (contained in another word or in HTML tag)
        while i != -1: 
            i = text.find(word, i + 1)

            if i + 1 + len(word) < text_length:

                # We check the before and after character of the word because some words can be contained in others
                # Like "vision" is in "revision". As well as being contained in HTML tags
                before = text[i - 1]
                after = text[i + len(word)]
                if (before in authorized_characters and
                    after in authorized_characters and not
                    (before == u'.' and after == u'.')):
                    words_indexes.append(i)
                    found_words.append(word)

                    i = -1

    return words_indexes, found_words

随着大单词列表和大文本的出现，它开始需要花费一些时间（虽然不是很大，但是这并不是我唯一的处理方法，因为它是Django视图的一部分，因此改善时间始终是好的。

有了这些1282个单词和231884个字符的长文本（摘自Waitbutwhy文章并进行了处理），我设法在计算机上执行了大约0.3s的时间。

但是我觉得有一种更好的方法，因为find()方法占用了大部分的计算时间，正如您在line_profiler中看到的那样

Total time: 0.28045 s
Function: find_indexes at line 332

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
   332                                           @line_profiler
   333                                           def find_indexes(text, words):
   334         1            4      4.0      0.0      words_indexes = []
   335         1            2      2.0      0.0      found_words = []
   336         1            2      2.0      0.0      authorized_characters = [u' ', u'.', u':', u';', u'?', u'!', u'¿', u'¡', u'…', u'(', u')']
   337                                           
   338         1            2      2.0      0.0      text_length = len(text)
   339                                           
   340      1283         4362      3.4      0.7      for j, word in enumerate(words):
   341      1282         1646      1.3      0.3          i = 0
   342                                           
   343      3436        11402      3.3      1.8          while i != -1:
   344      2154       543861    252.5     86.2              i = text.find(word, i + 1)
   345                                           
   346      2154        22153     10.3      3.5              if i + 1 + len(word) < text_length:
   347                                           
   348                                                           # We check the before and after character of the word because some words can be contained in others
   349                                                           # Like "vision" is in "revision". As well as being contained in HTML tags
   350      2154        16388      7.6      2.6                  before = text[i - 1]
   351      2154        19939      9.3      3.2                  after = text[i + len(word)]
   352      2154         7720      3.6      1.2                  if (before in authorized_characters and
   353       531         1468      2.8      0.2                      after in authorized_characters and not
   354       135          278      2.1      0.0                      (before == u'.' and after == u'.')):
   355       135          783      5.8      0.1                      words_indexes.append(i)
   356       135          428      3.2      0.1                      found_words.append(word)
   357                                           
   358       135          573      4.2      0.1                      i = -1
   359                                           
   360         1            2      2.0      0.0      return words_indexes, found_words

Answer 1

这是使用HTML解析器（因此它从文档中过滤出文本元素以避免在属性/标签中查找文本）的示例，这是一个编译后的正则表达式（它可以一次扫描所有单词，而不是多次循环N次（您的主瓶颈））：

import ast
# regex (not the builtin one) and bs4 need to be pip installed 
import regex
from bs4 import BeautifulSoup

# Parse the document so we don't have to worry about HTML stuff
# and can find actual text content more easily
with open('text_to_find_the_words.txt') as fin:
    soup = BeautifulSoup(fin, 'html.parser')

# Get the words to look at and compile a regex to find them
# Might already be a list in memory instead of a file.
with open('list_of_words.txt') as fin:
    words = ast.literal_eval(fin.read())
    matching_words = regex.compile(r'\b(\L<words>)\b', words=words)

# For each matching text elements, do the highlighting
for match in soup.find_all(text=matching_words):
    subbed = matching_words.sub(r'<span style="background: yellow;">\1</span>', match))
    match.replace_with(BeautifulSoup(subbed, 'html.parser'))

# Write the results somewhere (probably to a HttpResponse object in your case)
with open('results.html', 'w') as fout:
    fout.write(str(soup))

您需要进行调整以仅在需要时突出显示一个单词。

查找大文本中许多字符串的第一个匹配索引的最快方法

问题描述

1 个解决方案

解决方案1
1 2017-08-09 09:52:14

查找大文本中许多字符串的第一个匹配索引的最快方法

问题描述

1 个解决方案

解决方案1 1 2017-08-09 09:52:14

解决方案1
1 2017-08-09 09:52:14