简体   繁体   中英

defaultdict vs dict element initialization

I am trying to optimize the performance of a script that looks up similar words in a lexicon for each word given.

Each unique word is to be split into letter n-grams and for each n-gram, the lexicon returns a list of words that contain the same letter n-gram. Each word from this list is then added to a dictionary as a key and it's value is incremented by one. This gives me a dictionary of similar words with corresponding frequency scores.

word_dict = {}
get = word_dict.get
for letter_n_gram in word:
    for entry in lexicon[n_gram]:
        word_dict[entry] = get(entry, 0) + 1

This implementation works, but the script could supposedly run faster by switching the dict for collections.defaultdict .

word_dd = defaultdict(int)
for letter_n_gram in word:
    for entry in lexicon[n_gram]:
        word_dd[entry] += 1

No other code has been changed.

I was under the impression that both code snippets (most importantly the score adding) should work in the exact same way, ie if the key exists, increase its value by 1, if it does not exist, create the key and set the value to 1.

After running the new code, however, some of the keys had values of 0, which I find logically impossible.

Is my logic or knowledge of defaultdict functionality flawed? If not, how can any value in word_dd be set to 0?

edit: I am also very sure that no other part of the script skews these results, as I test the dictionary immediately after shown code by using:

for item in word_dd.iteritems():
    if item[1] == 0:
        print "Found zero value element"
        break

When you access a key in a defaultdict , if it is not there, it will be created automatically. Since we have int as the default factory function, it creates the key and gives the default value 0.

from collections import defaultdict
d = defaultdict(int)
print d["a"]
# 0
print d
# defaultdict(<type 'int'>, {'a': 0})

So, before accessing the key, you should make sure that it exists in the defaultdict instance, like this

print "a" in d
# False

Any item access to a key will materialise the value:

>>> from collections import defaultdict
>>> d = defaultdict(int)
>>> d['foo']
0

Use containment to test for the existence instead:

>>> 'bar' in d
False
>>> 'foo' in d
True

Since you are counting n-grams, you probably want to look at collections.Counter() as well:

from collections import Counter

word_counter = Counter()
for letter_n_gram in word:
    word_counter.update(lexicon[n_gram])

where the Counter.update() will update counts for all entries the lexicon[n_gram] expression returns.

Like defaultdict(int) , Counter() objects materialise values automatically, defaulting to integer 0 .

Alas, I have found the fault in my code.

As there are many consequent word n-grams with the same tested word in my input set, I am only creating the dictionary of similar words once per unique tested word.

This dictionary is then used for other purposes with keys being tested multiple times. This, of course, can create zero-valued elements, if the dictionary is collections.defaultdict and the default factory is not set to None .

Testing for the zero-valued elements was, however, done in each main loop - therefore finding zero-valued elements created in the previous loop.

After indenting the testing code into the proper part, the results are as expected - no zero-valued elements immediately after creation.

I would like to apologize to everyone for the faulty and incomplete construction of my question - it was impossible for anyone else to find the error.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM