简体   繁体   English

设置检查词袋的Python

[英]Python in Set Checking Bag of Words

I have a text processing script that needs to check if a word is in a bag of words. 我有一个文本处理脚本,需要检查一个单词是否在一个单词袋中。 I have to do this many (10000s) of times. 我必须做很多次(10000次)。 I would think that the most efficient way to do this would be to define the set of words that I am checking for wordBag = set(['these', 'are', 'my', 'words']) and then do if word in wordBag: . 我认为最有效的方法是定义我正在检查wordBag = set(['these', 'are', 'my', 'words']) ,然后if word in wordBag: I looked at the documentation and this is average case O(1) and worst case O(n). 我查看了文档,这是平均情况O(1)和最坏情况O(n)。 Is this just due to chaining in hashsets? 这只是因为链接集合? Is there a more pythonic way of doing this? 有更多的pythonic方式吗?

It really depends on the size of your set of words. 这实际上取决于你的单词的大小。 For any non-huge amount (probably up to tens of thousands) your approach is absolutely fine and totally pythonic. 对于任何非庞大的数量(可能高达数万),你的方法绝对精细,完全是pythonic。 Simplicity is a virtue! 简单是一种美德!

If you do have a vast amount of words in your bag, a prefix-tree (or "trie") approach will likely be best, for that you can check PyPI for existing implementations ( example ). 如果你的包中有大量的单词, 前缀树 (或“trie”)方法可能是最好的,因为你可以检查现有实现的PyPI( 例子 )。

EDIT: The O(n) worst-case is only in the case of huge numbers of hash collisions as you suspected, but that's not usually a problem in practice. 编辑: O(n)最坏情况只是在您怀疑的大量哈希冲突的情况下,但这在实践中通常不是问题。 In any case, I would go with the simplest approach first and only look at more advanced solutions if you actually have performance or memory issues. 在任何情况下,我都会先采用最简单的方法,只有在遇到性能或内存问题时才会考虑更高级的解决方案。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM