简体   繁体   中英

Fast dictionary lookup from dictionary using a list of keys

my dictionary:

d = {'a':1, 'b':2, 'c':3}

and my list of keys:

keys = np.array(['a','b','a','c','a','b'])

I would like to have, without using for loops, the list of the corresponding values

I tried with for loops in the following way, but it's too computationally expensive for the purpose I am working at.

This is the for loop version.

l = [d[i] for i in keys]

Do you know a version WITHOUT FOR LOOPS, maybe exploiting broadcasting..masks of np.array?

For the general case, the list comprehension approach [d[i] for i in keys] is fine.


For very large lists , one approach to gain some improvement in performance would be to define a structured array, which allows to work with mixed types, and use np.searchsorted :

def str_array(d, keys):
    items = list(d.items())
    k, v = zip(*items)
    dtype_v = np.max(v).itemsize
    dtype_k = np.array(k).dtype
    a = np.array(items, dtype=[('key', dtype_k), 
                               ('value', f'i{dtype_v}')])
    ixs_s = np.argsort(a['key'])
    k_ixs = ixs_s[np.searchsorted(a['key'], keys, sorter=ixs_s)]
    return a['value'][k_ixs]

str_array(d,keys)
# array([1, 2, 1, 3, 1, 2])

Let's compare performances along with some other typical approaches:

d = {'key1':100, 'some_other_key':8, 'key3':15, 'nth_key':0}

perfplot.show(
    setup=lambda n: np.random.choice(list(d.keys()), size=n), 

    kernels=[
        lambda x: np.array([d[i] for i in x]),
        lambda x: np.vectorize(d.get)(x),
        lambda x: pd.Series(d).loc[x].values,
        lambda x: operator.itemgetter(*x)(d),
        lambda x: str_array(d, x),
    ],

    labels=['list-comp', 'np.vectorize', 'pd.loc', 'itemgetter', 'str_array'],
    n_range=[2**k for k in range(0, 20)],
    xlabel='N'
)

在此处输入图片说明


So for instance for n=100_000 :

keys = np.random.choice(list(d.keys()), size=100_000)

%timeit str_array(d, keys)
# 5.51 ms ± 231 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

%timeit [d[i] for i in keys]
# 51.7 ms ± 1.97 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

With the second approach using np.searchsorted we get a 10 times faster approach than with a simple list comprehension.

I dont know about relative performance, but I find this solution to be very fast and simple. Convert your keys to a series and then use the pandas built in map function to return your answer.

import pandas as pd
d = {'a':1, 'b':2, 'c':3}
keys = np.array(['a','b','a','c','a','b'])
keys1 = pd.Series(keys)
keys1.map(d)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM