简体繁体 English

设置检查词袋的Python

[英]Python in Set Checking Bag of Words

原文 2015-02-27 18:57:28 4 1 python/ performance

I have a text processing script that needs to check if a word is in a bag of words. 我有一个文本处理脚本，需要检查一个单词是否在一个单词袋中。 I have to do this many (10000s) of times. 我必须做很多次（10000次）。 I would think that the most efficient way to do this would be to define the set of words that I am checking for wordBag = set(['these', 'are', 'my', 'words']) and then do if word in wordBag: . 我认为最有效的方法是定义我正在检查wordBag = set(['these', 'are', 'my', 'words']) ，然后if word in wordBag: 。 I looked at the documentation and this is average case O(1) and worst case O(n). 我查看了文档，这是平均情况O（1）和最坏情况O（n）。 Is this just due to chaining in hashsets? 这只是因为链接集合？ Is there a more pythonic way of doing this? 有更多的pythonic方式吗？

1 个解决方案

It really depends on the size of your set of words. 这实际上取决于你的单词的大小。 For any non-huge amount (probably up to tens of thousands) your approach is absolutely fine and totally pythonic. 对于任何非庞大的数量（可能高达数万），你的方法绝对精细，完全是pythonic。 Simplicity is a virtue! 简单是一种美德！

If you do have a vast amount of words in your bag, a prefix-tree (or "trie") approach will likely be best, for that you can check PyPI for existing implementations ( example ). 如果你的包中有大量的单词，前缀树（或“trie”）方法可能是最好的，因为你可以检查现有实现的PyPI（例子）。

EDIT: The O(n) worst-case is only in the case of huge numbers of hash collisions as you suspected, but that's not usually a problem in practice. 编辑： O(n)最坏情况只是在您怀疑的大量哈希冲突的情况下，但这在实践中通常不是问题。 In any case, I would go with the simplest approach first and only look at more advanced solutions if you actually have performance or memory issues. 在任何情况下，我都会先采用最简单的方法，只有在遇到性能或内存问题时才会考虑更高级的解决方案。