在markdown中检查单词的快速方法？

Question

I want to scan text for the presence of words from a list of words.我想扫描文本中是否存在单词列表中的单词。 This would be straightforward if the text were unformatted, but it is markdown-formatted.如果文本未格式化，这将很简单，但它是降价格式的。 At the moment, I'm accomplishing this with regex:目前，我正在使用正则表达式完成此操作：

import re

text = 'A long text string with **markdown** formatting.'
words = ['markdown', 'markup', 'marksideways']
found_words = []

for word in words:
    word_pattern = re.compile(r'(^|[ \*_])' + word + r'($|[ \*_.!?])', (re.I | re.M))
    match = word_pattern.search(text)
    if match:
        found_words.append(word)

I'm working with a very long list of words (a sort of denylist) and very large candidate texts, so speed is important to me.我正在处理一个很长的单词列表（一种拒绝列表）和非常大的候选文本，所以速度对我来说很重要。 Is this a relatively efficient and speedy way to do this?这是一种相对有效和快速的方法吗？ Is there a better approach?有没有更好的方法？

Answer 1

Have you considered stripping leading and trailing asterisks?您是否考虑过去除前导和尾随星号？

import re

from timeit import default_timer as timer


text = 'A long text string with **markdown** formatting.'
words = ['markdown', 'markup', 'marksideways']

def regexpCheck(words, text, n):
    found_words = []

    start = timer()
    for i in range(n):
        for word in words:
            word_pattern = re.compile(r'(^|[ \*_])' + word + r'($|[ \*_.!?])', (re.I | re.M))
            match = word_pattern.search(text)
            if match:
                found_words.append(word)

    end = timer()
    return (end - start)


def stripCheck(words, text, n):
    found_words = []

    start = timer()
    for i in range(n):
        for word in text.split():
            candidate = word.strip('*')
            if candidate in words:
                found_words.append(candidate)
    end = timer()

    return (end - start)


n = 10000
print(stripCheck(words, text, n))
print(regexpCheck(words, text, n))

On my run, it's about an order of magnitude faster:在我看来，它快了一个数量级：

0.010649851000000002
0.12086547399999999

在markdown中检查单词的快速方法？

问题描述

1 个解决方案

解决方案1
1 已采纳 2019-07-17 16:29:06

在markdown中检查单词的快速方法？

问题描述

1 个解决方案

解决方案1 1 已采纳 2019-07-17 16:29:06

解决方案1
1 已采纳 2019-07-17 16:29:06