简体   繁体   English

使用python快速查找大字典中的正则表达式匹配

[英]Find regex match in large dictionary using python quickly

I have a large dictionary containing regex values as the key and a numeric value as a value, and given a corpus (broken down into a list of individual word tokens) I would like to to find the regex value that best matches my word to obtain its respective value.我有一个包含正则表达式值作为键和一个数值作为值的大型字典,并给定一个语料库(分解为单个单词标记的列表)我想找到最匹配我的单词的正则表达式值以获得其各自的价值。

The dictionary contains many regex values that are ambiguous, in the sense that a word may have multiple regex matches, and therefore you would want to find the longest regex or 'best match' (ex: dictionary contains affect+, as well as affected an affection)字典包含许多模棱两可的正则表达式值,从某种意义上说,一个词可能有多个正则表达式匹配,因此您需要找到最长的正则表达式或“最佳匹配”(例如:字典包含影响+,以及受影响的情感)

My issue is when running a large text sample through the dictionary and finding the regex match of each word token, it takes a long amount of time (0.1s per word), which obviously adds up over 1000's of words.我的问题是,当通过字典运行大型文本样本并找到每个单词标记的正则表达式匹配时,需要很长时间(每个单词 0.1 秒),这显然会增加 1000 多个单词。 This is because it goes through the whole dictionary each time to find the 'best match'.这是因为它每次都会遍历整个字典以找到“最佳匹配”。

Is there a faster way to achieve this?有没有更快的方法来实现这一目标? Please see the problematic part of my code below.请参阅下面我的代码有问题的部分。

for word in textTokens:
    for reg,value in dictionary.items():
        if(re.match(reg, word)):
            matchedWords.append(reg)

Because you mentioned the input regexes have the structure of word+ a simple word and a plus regex symbol, you can use a modified version of the optimal Aho-Corasick algorithm , in this algorithm you make a finite-state-machine from in search patterns that can easily be modified accept some regex signs easily, like in your case a very simplistic solution would be to pad your keys to length of longest words in the list and accept anything that comes after the padding, there is an easy to implement wildcards are '.'因为您提到输入正则表达式具有word+一个简单单词和一个加号正则表达式符号的结构,您可以使用优化Aho-Corasick 算法的修改版本,在该算法中,您可以从搜索模式中创建一个有限状态机可以很容易地修改接受一些正则表达式符号,就像在你的情况下,一个非常简单的解决方案是将你的键填充到列表中最长单词的长度并接受填充之后的任何内容,有一个易于实现的通配符是' 。 and '?', for * one have to go to end of the word or return and follow the other path through list, which can be exponential number of choices in constant memory(any and all deterministic finite automatons).和“?”,因为*一个人必须走到单词的末尾或返回并通过列表遵循另一条路径,这可以是常量内存中的指数选择数(任何和所有确定性有限自动机)。

A finite state machine for a list of your regex keys can be made in linear time, meaning it takes time proportional to the sum of length of your dictionary keys.您的正则表达式键列表的有限状态机可以在线性时间内制作,这意味着它所花费的时间与您的字典键的长度之和成正比。 as explained here there is also support for longest matched word in the dictionary.正如这里所解释的,还支持字典中最长匹配的单词。

Thank you everyone for your answers, they were all very interesting and helpful.谢谢大家的回答,他们都非常有趣和乐于助人。 Ultimately i was forced to hand in my project without implementing a clean solution due to time constraints, but i have finally been able togo back and implement the Trie structure recommended by Stef .最终,由于时间限制,我被迫在没有实施干净的解决方案的情况下提交我的项目,但我终于能够返回并实施Stef推荐的 Trie 结构。 With a trie implemented, i was able to increase the speed of my algorithm by over 30x.通过实现 trie,我能够将算法的速度提高 30 倍以上。

For anyone interested in seeing my final project, the application can be found here , and the code on my Github对于有兴趣查看我的最终项目的任何人,可以在此处找到该应用程序,以及我的Github上的代码

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM