简体   繁体   English

在markdown中检查单词的快速方法?

[英]Fast way to check for words in markdown?

I want to scan text for the presence of words from a list of words.我想扫描文本中是否存在单词列表中的单词。 This would be straightforward if the text were unformatted, but it is markdown-formatted.如果文本未格式化,这将很简单,但它是降价格式的。 At the moment, I'm accomplishing this with regex:目前,我正在使用正则表达式完成此操作:

import re

text = 'A long text string with **markdown** formatting.'
words = ['markdown', 'markup', 'marksideways']
found_words = []

for word in words:
    word_pattern = re.compile(r'(^|[ \*_])' + word + r'($|[ \*_.!?])', (re.I | re.M))
    match = word_pattern.search(text)
    if match:
        found_words.append(word)

I'm working with a very long list of words (a sort of denylist) and very large candidate texts, so speed is important to me.我正在处理一个很长的单词列表(一种拒绝列表)和非常大的候选文本,所以速度对我来说很重要。 Is this a relatively efficient and speedy way to do this?这是一种相对有效和快速的方法吗? Is there a better approach?有没有更好的方法?

Have you considered stripping leading and trailing asterisks?您是否考虑过去除前导和尾随星号?

import re

from timeit import default_timer as timer


text = 'A long text string with **markdown** formatting.'
words = ['markdown', 'markup', 'marksideways']

def regexpCheck(words, text, n):
    found_words = []

    start = timer()
    for i in range(n):
        for word in words:
            word_pattern = re.compile(r'(^|[ \*_])' + word + r'($|[ \*_.!?])', (re.I | re.M))
            match = word_pattern.search(text)
            if match:
                found_words.append(word)

    end = timer()
    return (end - start)


def stripCheck(words, text, n):
    found_words = []

    start = timer()
    for i in range(n):
        for word in text.split():
            candidate = word.strip('*')
            if candidate in words:
                found_words.append(candidate)
    end = timer()

    return (end - start)


n = 10000
print(stripCheck(words, text, n))
print(regexpCheck(words, text, n))

On my run, it's about an order of magnitude faster:在我看来,它快了一个数量级:

0.010649851000000002
0.12086547399999999

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM