[英]Fastest way to match substring from large dict
I have some (usually < 300 symbols length) string like 'aabbccdcabcbbacdaaa'. 我有一些(通常<300符号长度)字符串,如'aabbccdcabcbbacdaaa'。
There is python dictionary where keys are strings in similar format, eg 'bcccd', key length varies from 10 to 100 symbols. 有python字典,其中键是类似格式的字符串,例如'bcccd',密钥长度从10到100个符号不等。 The dictionary has half a million items. 这本词典有五十万件。
I need to match my initial string with dictionary's value or find out that there are no proper values in dictionary. 我需要将我的初始字符串与字典值匹配,或者发现字典中没有正确的值。 Matching condition: dictionary key should be somewhere within string (strict matching). 匹配条件:字典键应该在字符串内的某处(严格匹配)。
What is the best way, in terms of computational speed, to do it? 在计算速度方面,最好的方法是什么? I feel there should be some tricky way to hash my initial string and dictionary keys as to apply some clever ways of substring search (like Rabin-Karp or Knuth-Morris-Pratt). 我觉得应该有一些棘手的方法来哈希我的初始字符串和字典键,以便应用一些聪明的子字符串搜索方式(如Rabin-Karp或Knuth-Morris-Pratt)。 Or suffix tree-like structure could be a good solution? 或后缀树状结构可能是一个很好的解决方案?
Just found a reasonable implementation of Aho-Corasick for Python - pyahocorasick . 刚刚找到了Aho-Corasick for Python的合理实现 - pyahocorasick 。 Taking from the example at the end of the page: 从页面末尾的示例中获取:
import ahocorasick
A = ahocorasick.Automaton()
for k, v in your_big_dict.iteritems():
A.add_word(k, v)
A.make_automaton()
for item in A.iter(your_long_string):
print(item)
You can use the following format: 您可以使用以下格式:
for key in your_dictionary:
if key in your_string:
print(key+' is in both your string and the dictionary. It has the value '+str(your_dictionary[key]))
If you want this changed in any way, let me know in the comments, I'll be happy to update. 如果您希望以任何方式更改,请在评论中告诉我,我将很乐意更新。
def search(string, dict_search):
# If those 2 lines are too expensive, calculate them and pass as arguments
max_key = max(len(x) for x in dict_search)
min_key = min(len(x) for x in dict_search)
return set(
string[x:x+i]
for i in range(min_key, max_key+1)
for x in range(len(string)-i+1)
if string[x:x+i] in dict_search
)
Running: 运行:
>>> search('aabbccdcabcbbacdaaa', {'aaa', 'acd', 'adb', 'bccd', 'cbbb', 'abc'})
{'aaa', 'abc', 'acd', 'bccd'}
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.