[英]Python in Set Checking Bag of Words
I have a text processing script that needs to check if a word is in a bag of words. 我有一个文本处理脚本,需要检查一个单词是否在一个单词袋中。 I have to do this many (10000s) of times.
我必须做很多次(10000次)。 I would think that the most efficient way to do this would be to define the set of words that I am checking for
wordBag = set(['these', 'are', 'my', 'words'])
and then do if word in wordBag:
. 我认为最有效的方法是定义我正在检查
wordBag = set(['these', 'are', 'my', 'words'])
,然后if word in wordBag:
。 I looked at the documentation and this is average case O(1) and worst case O(n). 我查看了文档,这是平均情况O(1)和最坏情况O(n)。 Is this just due to chaining in hashsets?
这只是因为链接集合? Is there a more pythonic way of doing this?
有更多的pythonic方式吗?
It really depends on the size of your set of words. 这实际上取决于你的单词的大小。 For any non-huge amount (probably up to tens of thousands) your approach is absolutely fine and totally pythonic.
对于任何非庞大的数量(可能高达数万),你的方法绝对精细,完全是pythonic。 Simplicity is a virtue!
简单是一种美德!
If you do have a vast amount of words in your bag, a prefix-tree (or "trie") approach will likely be best, for that you can check PyPI for existing implementations ( example ). 如果你的包中有大量的单词, 前缀树 (或“trie”)方法可能是最好的,因为你可以检查现有实现的PyPI( 例子 )。
EDIT: The O(n)
worst-case is only in the case of huge numbers of hash collisions as you suspected, but that's not usually a problem in practice. 编辑:
O(n)
最坏情况只是在您怀疑的大量哈希冲突的情况下,但这在实践中通常不是问题。 In any case, I would go with the simplest approach first and only look at more advanced solutions if you actually have performance or memory issues. 在任何情况下,我都会先采用最简单的方法,只有在遇到性能或内存问题时才会考虑更高级的解决方案。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.