[英]Fastest way to find first match index of lots of strings in large text
我已经在python 2.7.13中实现了快速查找算法。 它可以满足我的要求,但性能问题很小。 这些是我的算法特性:
我有这个实现:
def find_indexes(text, words):
words_indexes = []
found_words = []
authorized_characters = [u' ', u'.', u':', u';', u'?', u'!', u'¿', u'¡', u'…', u'(', u')']
text_length = len(text)
for j, word in enumerate(words):
i = 0
# This loop serves to go to the next word find if the first one isn't valid (contained in another word or in HTML tag)
while i != -1:
i = text.find(word, i + 1)
if i + 1 + len(word) < text_length:
# We check the before and after character of the word because some words can be contained in others
# Like "vision" is in "revision". As well as being contained in HTML tags
before = text[i - 1]
after = text[i + len(word)]
if (before in authorized_characters and
after in authorized_characters and not
(before == u'.' and after == u'.')):
words_indexes.append(i)
found_words.append(word)
i = -1
return words_indexes, found_words
随着大单词列表和大文本的出现,它开始需要花费一些时间(虽然不是很大,但是这并不是我唯一的处理方法,因为它是Django视图的一部分,因此改善时间始终是好的。
有了这些1282个单词和231884个字符的长文本 (摘自Waitbutwhy文章并进行了处理),我设法在计算机上执行了大约0.3s的时间。
但是我觉得有一种更好的方法,因为find()
方法占用了大部分的计算时间,正如您在line_profiler中看到的那样
Total time: 0.28045 s
Function: find_indexes at line 332
Line # Hits Time Per Hit % Time Line Contents
==============================================================
332 @line_profiler
333 def find_indexes(text, words):
334 1 4 4.0 0.0 words_indexes = []
335 1 2 2.0 0.0 found_words = []
336 1 2 2.0 0.0 authorized_characters = [u' ', u'.', u':', u';', u'?', u'!', u'¿', u'¡', u'…', u'(', u')']
337
338 1 2 2.0 0.0 text_length = len(text)
339
340 1283 4362 3.4 0.7 for j, word in enumerate(words):
341 1282 1646 1.3 0.3 i = 0
342
343 3436 11402 3.3 1.8 while i != -1:
344 2154 543861 252.5 86.2 i = text.find(word, i + 1)
345
346 2154 22153 10.3 3.5 if i + 1 + len(word) < text_length:
347
348 # We check the before and after character of the word because some words can be contained in others
349 # Like "vision" is in "revision". As well as being contained in HTML tags
350 2154 16388 7.6 2.6 before = text[i - 1]
351 2154 19939 9.3 3.2 after = text[i + len(word)]
352 2154 7720 3.6 1.2 if (before in authorized_characters and
353 531 1468 2.8 0.2 after in authorized_characters and not
354 135 278 2.1 0.0 (before == u'.' and after == u'.')):
355 135 783 5.8 0.1 words_indexes.append(i)
356 135 428 3.2 0.1 found_words.append(word)
357
358 135 573 4.2 0.1 i = -1
359
360 1 2 2.0 0.0 return words_indexes, found_words
这是使用HTML解析器(因此它从文档中过滤出文本元素以避免在属性/标签中查找文本)的示例,这是一个编译后的正则表达式(它可以一次扫描所有单词,而不是多次循环N次(您的主瓶颈)):
import ast
# regex (not the builtin one) and bs4 need to be pip installed
import regex
from bs4 import BeautifulSoup
# Parse the document so we don't have to worry about HTML stuff
# and can find actual text content more easily
with open('text_to_find_the_words.txt') as fin:
soup = BeautifulSoup(fin, 'html.parser')
# Get the words to look at and compile a regex to find them
# Might already be a list in memory instead of a file.
with open('list_of_words.txt') as fin:
words = ast.literal_eval(fin.read())
matching_words = regex.compile(r'\b(\L<words>)\b', words=words)
# For each matching text elements, do the highlighting
for match in soup.find_all(text=matching_words):
subbed = matching_words.sub(r'<span style="background: yellow;">\1</span>', match))
match.replace_with(BeautifulSoup(subbed, 'html.parser'))
# Write the results somewhere (probably to a HttpResponse object in your case)
with open('results.html', 'w') as fout:
fout.write(str(soup))
您需要进行调整以仅在需要时突出显示一个单词。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.