简体   繁体   中英

Python Set Comprehension Nested in Dict Comprehension

I have a list of tuples, where each tuple contains a string and a number in the form of:

[(string_1, num_a), (string_2, num_b), ...]

The strings are nonunique, and so are the numbers, eg (string_1 , num_m) or (string_9 , num_b) are likely to exist in the list.

I'm attempting to create a dictionary with the string as the key and a set of all numbers occurring with that string as the value:

dict = {string_1: {num_a, num_m}, string_2: {num_b}, ...}

I've done this somewhat successfully with the following dictionary comprehension with nested set comprehension:

#st_id_list = [(string_1, num_a), ...]
#st_dict = {string_1: {num_a, num_m}, ...} 
st_dict = {
    st[0]: set(
        st_[1]
        for st_ in st_id_list
        if st_[0] == st[0]
    )
    for st in st_id_list
}

There's only one issue: st_id_list is 18,000 items long. This snippet of code takes less than ten seconds to run for a list of 500 tuples, but over twelve minutes to run for the full 18,000 tuples. I have to think this is because I've nested a set comprehension inside a dict comprehension.

Is there a way to avoid this, or a smarter way to it?

You have a double loop, so you take O(N**2) time to produce your dictionary. For 500 items, 250.000 steps are taken, and for your 18k items, 324 million steps need to be done.

Here is a O(N) loop instead, so 500 steps for your smaller dataset, 18.000 steps for the larger dataset:

st_dict = {}
for st, id in st_id_list:
    st_dict.setdefault(st, set()).add(id)

This uses the dict.setdefault() method to ensure that for a given key (your string values), there is at least an empty set available if the key is missing, then adds the current id value to that set.

You can do the same with a collections.defaultdict() object :

from collections import defaultdict

st_dict = defaultdict(set)
for st, id in st_id_list:
    st_dict[st].add(id)

The defaultdict() uses the factory passed in to set a default value for missing keys.

The disadvantage of the defaultdict approach is that the object continues to produce default values for missing keys after your loop, which can hide application bugs. Use st_dict.default_factory = None to disable the factory explicitly to prevent that.

Why are you using two loops when you could do in one loop like this:

list_1=[('string_1', 'num_a'), ('string_2', 'num_b'),('string_1' , 'num_m'),('string_9' , 'num_b')]

string_num={}
for i in list_1:
    if i[0] not in string_num:
        string_num[i[0]]={i[1]}
    else:
        string_num[i[0]].add(i[1])

print(string_num)

output:

{'string_9': {'num_b'}, 'string_1': {'num_a', 'num_m'}, 'string_2': {'num_b'}}

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM