简体   繁体   English

设置vs. Regex进行字符串查找,哪个更具扩展性?

[英]Sets vs. Regex for string lookup, which is more scalable?

Suppose that I need to handle a very big list of words, and I need to count the number of times I find any of those words in a piece of text I have. 假设我需要处理大量单词,并且需要计算在一段文本中找到这些单词中任何一个的次数。 Which is the best option in terms of scalability? 就可伸缩性而言,哪个是最佳选择?

Option I (regex) 选项I(正则表达式)

>>> import re
>>> s = re.compile("|".join(big_list))
>>> len(s.find_all(sentence))

Option II (sets) 选项II(套)

>>> s = set(big_list)
>>> len([word for word in sentence.split(" ") if word in s]) # O(1) avg lookup time

Example: if the list is ["cat","dog","knee"] and the text is "the dog jumped over the cat, but the dog broke his knee" the final result should be: 4 示例:如果列表为[“ cat”,“ dog”,“ knee”],文字为“狗跳过了猫,但狗摔断了膝盖”,则最终结果应为:4

PS Any other option is welcome PS欢迎任何其他选择

If your words are alphanumeric, I might use something like: 如果您的单词是字母数字,我可能会使用类似:

s = set(big_list)
sum(1 for x in re.finditer(r'\b\w+\b',sentence) if x.group() in s)

Since the membership test for a set is on average O(1), this algorithm becomes O(N+M) where N is the number of words in the sentence and M is the number of elements in big_list. 由于集合的成员资格测试平均为O(1),因此该算法变为O(N + M),其中N是句子中的单词数,M是big_list中的元素数。 Not too shabby. 不是太寒酸。 It also does pretty well in terms of memory usage. 就内存使用而言,它也做得很好。

A scalable method would be sorting the input dictionary and words from the text then doing the matching using two iterators. 一种可伸缩的方法是从文本中对输入字典和单词进行排序,然后使用两个迭代器进行匹配。 You can also use a trie for even better performance. 您也可以使用trie以获得更好的性能。 I don't know the internal representation of the set, however, using a large regex would be a total overkill. 我不知道该集合的内部表示形式,但是,使用较大的正则表达式将完全过分。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM