简体   繁体   English

defaultdict vs dict元素初始化

[英]defaultdict vs dict element initialization

I am trying to optimize the performance of a script that looks up similar words in a lexicon for each word given. 我正在尝试优化脚本的性能,该脚本在给出的每个单词的词典中查找相似的单词。

Each unique word is to be split into letter n-grams and for each n-gram, the lexicon returns a list of words that contain the same letter n-gram. 每个唯一的单词将被分成字母n-gram,并且对于每个n-gram,词典返回包含相同字母n-gram的单词列表。 Each word from this list is then added to a dictionary as a key and it's value is incremented by one. 然后将此列表中的每个单词作为键添加到字典中,并将其值加1。 This gives me a dictionary of similar words with corresponding frequency scores. 这给了我一个具有相应频率分数的类似单词的字典。

word_dict = {}
get = word_dict.get
for letter_n_gram in word:
    for entry in lexicon[n_gram]:
        word_dict[entry] = get(entry, 0) + 1

This implementation works, but the script could supposedly run faster by switching the dict for collections.defaultdict . 这个实现可以工作,但是通过切换collections.defaultdictdict可以更快地运行脚本。

word_dd = defaultdict(int)
for letter_n_gram in word:
    for entry in lexicon[n_gram]:
        word_dd[entry] += 1

No other code has been changed. 没有其他代码被更改。

I was under the impression that both code snippets (most importantly the score adding) should work in the exact same way, ie if the key exists, increase its value by 1, if it does not exist, create the key and set the value to 1. 我的印象是两个代码片段(最重要的是分数添加)应该以完全相同的方式工作,即如果密钥存在,将其值增加1,如果它不存在,则创建密钥并将值设置为1。

After running the new code, however, some of the keys had values of 0, which I find logically impossible. 但是,在运行新代码之后,某些键的值为0,我觉得这在逻辑上是不可能的。

Is my logic or knowledge of defaultdict functionality flawed? 我对defaultdict功能的逻辑或知识是否有缺陷? If not, how can any value in word_dd be set to 0? 如果没有, word_dd任何值如何设置为0?

edit: I am also very sure that no other part of the script skews these results, as I test the dictionary immediately after shown code by using: 编辑:我也非常确定脚本中没有其他部分会扭曲这些结果,因为我使用以下代码在显示代码后立即测试字典:

for item in word_dd.iteritems():
    if item[1] == 0:
        print "Found zero value element"
        break

When you access a key in a defaultdict , if it is not there, it will be created automatically. 当您访问defaultdict的密钥时,如果它不在那里,它将自动创建。 Since we have int as the default factory function, it creates the key and gives the default value 0. 由于我们将int作为默认工厂函数,因此它会创建密钥并提供默认值0。

from collections import defaultdict
d = defaultdict(int)
print d["a"]
# 0
print d
# defaultdict(<type 'int'>, {'a': 0})

So, before accessing the key, you should make sure that it exists in the defaultdict instance, like this 因此,在访问密钥之前,您应确保它存在于defaultdict实例中,如下所示

print "a" in d
# False

Any item access to a key will materialise the value: 对密钥的任何项访问都将实现该值:

>>> from collections import defaultdict
>>> d = defaultdict(int)
>>> d['foo']
0

Use containment to test for the existence instead: 使用包含来测试存在而不是:

>>> 'bar' in d
False
>>> 'foo' in d
True

Since you are counting n-grams, you probably want to look at collections.Counter() as well: 由于你在计算n-gram,你可能想看看collections.Counter()

from collections import Counter

word_counter = Counter()
for letter_n_gram in word:
    word_counter.update(lexicon[n_gram])

where the Counter.update() will update counts for all entries the lexicon[n_gram] expression returns. 其中Counter.update()将更新lexicon[n_gram]表达式返回的所有条目的计数。

Like defaultdict(int) , Counter() objects materialise values automatically, defaulting to integer 0 . defaultdict(int)Counter()对象自动实现值,默认为整数0

Alas, I have found the fault in my code. 唉,我在代码中发现了错误。

As there are many consequent word n-grams with the same tested word in my input set, I am only creating the dictionary of similar words once per unique tested word. 由于在我的输入集中有许多随后的单词n-gram和相同的测试单词,我只是每个唯一的测试单词创建一个相似单词的字典。

This dictionary is then used for other purposes with keys being tested multiple times. 然后,该字典用于其他目的,其中密钥被多次测试。 This, of course, can create zero-valued elements, if the dictionary is collections.defaultdict and the default factory is not set to None . 当然,如果字典是collections.defaultdict并且默认工厂未设置为None ,则可以创建零值元素。

Testing for the zero-valued elements was, however, done in each main loop - therefore finding zero-valued elements created in the previous loop. 然而,在每个主循环中进行零值元素的测试 - 因此找到在前一循环中创建的零值元素。

After indenting the testing code into the proper part, the results are as expected - no zero-valued elements immediately after creation. 在将测试代码缩进到适当的部分之后,结果如预期的那样 - 在创建之后不会立即出现零值元素。

I would like to apologize to everyone for the faulty and incomplete construction of my question - it was impossible for anyone else to find the error. 我想向所有人道歉,因为我的问题错误和不完整 - 其他人都无法找到错误。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM