[英]Fast way to check for words in markdown?
I want to scan text for the presence of words from a list of words.我想扫描文本中是否存在单词列表中的单词。 This would be straightforward if the text were unformatted, but it is markdown-formatted.
如果文本未格式化,这将很简单,但它是降价格式的。 At the moment, I'm accomplishing this with regex:
目前,我正在使用正则表达式完成此操作:
import re
text = 'A long text string with **markdown** formatting.'
words = ['markdown', 'markup', 'marksideways']
found_words = []
for word in words:
word_pattern = re.compile(r'(^|[ \*_])' + word + r'($|[ \*_.!?])', (re.I | re.M))
match = word_pattern.search(text)
if match:
found_words.append(word)
I'm working with a very long list of words (a sort of denylist) and very large candidate texts, so speed is important to me.我正在处理一个很长的单词列表(一种拒绝列表)和非常大的候选文本,所以速度对我来说很重要。 Is this a relatively efficient and speedy way to do this?
这是一种相对有效和快速的方法吗? Is there a better approach?
有没有更好的方法?
Have you considered stripping leading and trailing asterisks?您是否考虑过去除前导和尾随星号?
import re
from timeit import default_timer as timer
text = 'A long text string with **markdown** formatting.'
words = ['markdown', 'markup', 'marksideways']
def regexpCheck(words, text, n):
found_words = []
start = timer()
for i in range(n):
for word in words:
word_pattern = re.compile(r'(^|[ \*_])' + word + r'($|[ \*_.!?])', (re.I | re.M))
match = word_pattern.search(text)
if match:
found_words.append(word)
end = timer()
return (end - start)
def stripCheck(words, text, n):
found_words = []
start = timer()
for i in range(n):
for word in text.split():
candidate = word.strip('*')
if candidate in words:
found_words.append(candidate)
end = timer()
return (end - start)
n = 10000
print(stripCheck(words, text, n))
print(regexpCheck(words, text, n))
On my run, it's about an order of magnitude faster:在我看来,它快了一个数量级:
0.010649851000000002
0.12086547399999999
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.