[英]How to speed up string match against a list of strings?
I have a list of strings. 我有一个字符串列表。 I am trying to find if any of these strings in the list appear in the english dictionary stored as another list.
我试图找到列表中的任何这些字符串是否出现在作为另一个列表存储的英语词典中。
I observed the time it takes to find a match grows linearly. 我观察到找到一个匹配线性增长所需的时间。 However, it becomes way too long when the original list has a few thousand strings.
但是,当原始列表有几千个字符串时,它变得太长了。
On my development EC2 instance, it takes ~2 seconds for 100 strings, ~15 seconds for 700 strings, ~100 seconds for 5000 strings, and ~800 seconds for 40000 strings! 在我的开发EC2实例中,100个字符串需要约2秒,700个字符串需要约15秒,5000个字符串需要约100秒,40000个字符串需要约800秒!
Is there a way to speed this up? 有没有办法加快速度? Thanks in advance.
提前致谢。
matching_word = ""
for w in all_strings:
if w in english_dict:
if matching_word: # More than one possible word
matching_word = matching_word + ", " + w
else:
matching_word = w
Instead of creating a string and extend it you can use list comprehension for that: 而不是创建一个字符串并扩展它,你可以使用列表理解:
matching_words = [x for x in all_strings if x in english_dict]
Now you can make a string from that list using ", ".join(matching_sords)
. 现在,您可以使用
", ".join(matching_sords)
从该列表中创建一个字符串。
Another option - using two sets you can use the &
operator: 另一种选择 - 使用两组你可以使用
&
运算符:
set(all_strings) & set(english_dict)
The result here will be a set with the items you have in both lists. 此处的结果将是包含两个列表中的项目的集合。
Provided you don't have issues with memory, turn your english_dict
to set
(if you do have memory issues, load your dictionary as a set
to begin with): english_dict = set(english_dict)
(prior to the loop, of course) 如果您没有内存问题,请将您的
english_dict
set
为set
(如果您确实遇到内存问题,请将字典加载为一set
开头): english_dict = set(english_dict)
(当然,在循环之前)
That should significantly speed up the look-up. 这应该会大大加快查询速度。 If that's not enough, you'll have to resort to creating search trees and similar search optimizations.
如果这还不够,您将不得不求助于创建搜索树和类似的搜索优化。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.