设置vs. Regex进行字符串查找，哪个更具扩展性？

Question

Suppose that I need to handle a very big list of words, and I need to count the number of times I find any of those words in a piece of text I have. 假设我需要处理大量单词，并且需要计算在一段文本中找到这些单词中任何一个的次数。 Which is the best option in terms of scalability? 就可伸缩性而言，哪个是最佳选择？

Option I (regex) 选项I（正则表达式）

>>> import re
>>> s = re.compile("|".join(big_list))
>>> len(s.find_all(sentence))

Option II (sets) 选项II（套）

>>> s = set(big_list)
>>> len([word for word in sentence.split(" ") if word in s]) # O(1) avg lookup time

Example: if the list is ["cat","dog","knee"] and the text is "the dog jumped over the cat, but the dog broke his knee" the final result should be: 4 示例：如果列表为[“ cat”，“ dog”，“ knee”]，文字为“狗跳过了猫，但狗摔断了膝盖”，则最终结果应为：4

PS Any other option is welcome PS欢迎任何其他选择

Answer 1

If your words are alphanumeric, I might use something like: 如果您的单词是字母数字，我可能会使用类似：

s = set(big_list)
sum(1 for x in re.finditer(r'\b\w+\b',sentence) if x.group() in s)

Since the membership test for a set is on average O(1), this algorithm becomes O(N+M) where N is the number of words in the sentence and M is the number of elements in big_list. 由于集合的成员资格测试平均为O（1），因此该算法变为O（N + M），其中N是句子中的单词数，M是big_list中的元素数。 Not too shabby. 不是太寒酸。 It also does pretty well in terms of memory usage. 就内存使用而言，它也做得很好。

Answer 2

A scalable method would be sorting the input dictionary and words from the text then doing the matching using two iterators. 一种可伸缩的方法是从文本中对输入字典和单词进行排序，然后使用两个迭代器进行匹配。 You can also use a trie for even better performance. 您也可以使用trie以获得更好的性能。 I don't know the internal representation of the set, however, using a large regex would be a total overkill. 我不知道该集合的内部表示形式，但是，使用较大的正则表达式将完全过分。

设置vs. Regex进行字符串查找，哪个更具扩展性？

问题描述

2 个解决方案

解决方案1
2 已采纳 2013-04-29 00:24:21

解决方案2
0 2013-04-29 00:14:50

设置vs. Regex进行字符串查找，哪个更具扩展性？

问题描述

2 个解决方案

解决方案1 2 已采纳 2013-04-29 00:24:21

解决方案2 0 2013-04-29 00:14:50

解决方案1
2 已采纳 2013-04-29 00:24:21

解决方案2
0 2013-04-29 00:14:50