简体   繁体   English

从大字典中匹配子字符串的最快方法

[英]Fastest way to match substring from large dict

I have some (usually < 300 symbols length) string like 'aabbccdcabcbbacdaaa'. 我有一些(通常<300符号长度)字符串,如'aabbccdcabcbbacdaaa'。

There is python dictionary where keys are strings in similar format, eg 'bcccd', key length varies from 10 to 100 symbols. 有python字典,其中键是类似格式的字符串,例如'bcccd',密钥长度从10到100个符号不等。 The dictionary has half a million items. 这本词典有五十万件。

I need to match my initial string with dictionary's value or find out that there are no proper values in dictionary. 我需要将我的初始字符串与字典值匹配,或者发现字典中没有正确的值。 Matching condition: dictionary key should be somewhere within string (strict matching). 匹配条件:字典键应该在字符串内的某处(严格匹配)。

What is the best way, in terms of computational speed, to do it? 在计算速度方面,最好的方法是什么? I feel there should be some tricky way to hash my initial string and dictionary keys as to apply some clever ways of substring search (like Rabin-Karp or Knuth-Morris-Pratt). 我觉得应该有一些棘手的方法来哈希我的初始字符串和字典键,以便应用一些聪明的子字符串搜索方式(如Rabin-Karp或Knuth-Morris-Pratt)。 Or suffix tree-like structure could be a good solution? 或后缀树状结构可能是一个很好的解决方案?

Just found a reasonable implementation of Aho-Corasick for Python - pyahocorasick . 刚刚找到了Aho-Corasick for Python的合理实现 - pyahocorasick Taking from the example at the end of the page: 从页面末尾的示例中获取:

import ahocorasick
A = ahocorasick.Automaton()

for k, v in your_big_dict.iteritems():
    A.add_word(k, v)

A.make_automaton()
for item in A.iter(your_long_string):
    print(item)

You can use the following format: 您可以使用以下格式:

for key in your_dictionary:
    if key in your_string:
        print(key+' is in both your string and the dictionary. It has the value '+str(your_dictionary[key]))

If you want this changed in any way, let me know in the comments, I'll be happy to update. 如果您希望以任何方式更改,请在评论中告诉我,我将很乐意更新。

def search(string, dict_search):
    # If those 2 lines are too expensive, calculate them and pass as arguments
    max_key = max(len(x) for x in dict_search)
    min_key = min(len(x) for x in dict_search)

    return set(
        string[x:x+i] 
        for i in range(min_key, max_key+1)
        for x in range(len(string)-i+1)
        if string[x:x+i] in dict_search
    )

Running: 运行:

>>> search('aabbccdcabcbbacdaaa', {'aaa', 'acd', 'adb', 'bccd', 'cbbb', 'abc'})
{'aaa', 'abc', 'acd', 'bccd'}

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM