简体   繁体   中英

Fastest way to generate a dict from list where key == value

I have a list, say:

NUM = 100
my_list = list(range(NUM))

I would like to generate a dict where the key is equal to the value, something like:

my_dict = {item: item for item in my_list}

or:

my_dict = dict(zip(my_list, my_list))

I have run some micro-benchmarks, and it looks like they have similar speed, but I was hoping that the second would be much faster, since the looping should be happening in C.

For example, the following construct:

my_dict = {key: SOMETHING for key in keys}

translates into the much faster:

my_dict = dict.fromkeys(k, SOMETHING)

So, my question is: is there any similar such construct for {x: x for x in my_list} ?


EDIT

I have checked dir(dict) and there seems to be nothing in this direction (I would expect it to be called something like dict.fromitems() ).


EDIT 2

A method like dict.fromitems() would have a broader application than this specific use-case, because:

dict.fromitems(keys, values)

could, in principle substitute both:

{k, v for k, v in zip(keys, values)}

and:

dict(zip(keys, values))

No, there is no faster method available for dictionaries.

That's because the performance cost is all in processing each item from the iterator, computing its hash and slotting the key into the dictionary data hash table structures (including growing those structures dynamically). Executing the dictionary comprehension bytecode is really insignificant in comparison.

dict(zip(it, it)) , {k: k for k in it} and dict.fromkeys(it) are all close in speed:

>>> from timeit import Timer
>>> tests = {
...     'dictcomp': '{k: k for k in it}',
...     'dictzip': 'dict(zip(it, it))',
...     'fromkeys': 'dict.fromkeys(it)',
... }
>>> timings = {n: [] for n in tests}
>>> for magnitude in range(2, 8):
...     it = range(10 ** magnitude)
...     for name, test in tests.items():
...         peritemtimes = []
...         for repetition in range(3):
...             count, total = Timer(test, 'from __main__ import it').autorange()
...             peritemtimes.append(total / count / (10 ** magnitude))
...         timings[name].append(min(peritemtimes))  # best of 3
...
>>> for name, times in timings.items():
...     print(f'{name:>8}', *(f'{t * 10 ** 9:5.1f} ns' for t in times), sep=' | ')
...
dictcomp |  46.5 ns |  47.5 ns |  50.0 ns |  79.0 ns | 101.1 ns | 111.7 ns
 dictzip |  49.3 ns |  56.3 ns |  71.6 ns | 109.7 ns | 132.9 ns | 145.8 ns
fromkeys |  33.9 ns |  37.2 ns |  37.4 ns |  62.7 ns |  87.6 ns |  95.7 ns

That's a table of the per-item cost for each technique, from 100 to 10 million items. The timings go up as the additional cost of growing the hash table structures accumulate.

Sure, dict.fromkeys() can process items a little bit faster, but it's not an order of magnitude faster than the other processes. It's (small) speed advantage does not come from being able to iterate in C here; the difference lies purely in not having to update the value pointer each iteration; all keys point to the single value reference.

zip() is slower because it builds additional objects (creating a 2-item tuple for each key-value pair is not a cost-free operation), and it increased the number of iterators involved in the process, you go from a single iterator for the dictionary comprehension and dict.fromkeys() , to 3 iterators (the dict() iteration delegated, via zip() , to two separate iterators for the keys and values).

There is no point in adding a separate method to the dict class to handle this in C, because

  1. is not a common enough use case anyway (creating a mapping with keys and values equal is not a common need)
  2. not going to be significantly faster in C than it would be with a dictionary comprehension anyway .

This is the Zen answer. The dictionary comprehension loop itself is fast, this is not the bottleneck. As Martijn Pieters says the time is spent:

  1. Accessing the key and;
  2. Computing the __hash__ of the key.
  3. Possibly your __hash__ is bad and you have a lot of collisions, causing the dictionary inserts to be expensive.

Don't worry about the loop. If it is taking a long time to build your dictionary it is because these operations are slow.

Using the results of the answer here , we create a new class that subclasses defaultdict, and override its missing attribute to allow for passing the key to the default_factory:

from collections import defaultdict
class keydefaultdict(defaultdict):
    def __missing__(self, key):
        if self.default_factory is None:
            raise KeyError(key)
        else:
            ret = self[key] = self.default_factory(key)
            return ret

Now you can create the sort of dictionary you're looking for by doing the following:

my_dict = keydefaultdict(lambda x: x)

Then, whenever you need do a mapping for the keys that don't map to themselves, you simply update those values.

Timings.

Subclassing defaultdict :

%%timeit
my_dict = keydefaultdict(lambda x: x)
for num in some_numbers: my_dict[num] == num

Results:

4.46 s ± 71.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Dict comprehension

%%timeit
my_dict = {x: x for x in some_numbers}
for num in some_numbers: my_dict[num] == num

Results:

1.19 s ± 20.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

The two become comparable when you wind up needing to access approximately 17% of the original values. And better if you need fewer:

Accessing only a portion of the original values

Subclassing defaultdict :

%%timeit
frac = 0.17
my_dict = keydefaultdict(lambda x: x)
for num in some_numbers[:int(len(some_numbers)*frac)]: my_dict[num] == num

Results:

770 ms ± 4.69 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Dict comprehension

%%timeit
frac = 0.175
my_dict = {x: x for x in some_numbers}
for num in some_numbers[:int(len(some_numbers)*frac)]: my_dict[num] == num

Results:

781 ms ± 4.03 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM