简体   繁体   中英

Faster Way to Lookup Values in Numpy Structured Array

While learning to create a hashmap using Numpy and Python 3, I came up with the following code which uses a Numpy structured array data .

However, the time it takes to select a value from a key is quite slow, as shown in the timeit runs comparing 13.3 secs for Numpy structured array data with 0.008 secs for Python dictionary d .

val = data[data['keys'] == key]['values'][0]

Is there a faster way to get the item for a particular key?

import numpy as np
import timeit

N = 1000*1000
keyArr = np.random.randint(0, 1000*1000*1000*4, N)
valArr = np.random.rand(N)
key = keyArr[0]                                     # Select an existing key value

# Numpy structured array
data = np.empty(keyArr.shape[0], dtype=[('keys', keyArr.dtype), ('values', valArr.dtype)])
data['keys'] = keyArr
data['values'] = valArr

val = data[data['keys'] == key]['values'][0]
print(key, '=>', val)                               # 558520981 => 0.17948995177905835
print( timeit.Timer("data[data['keys'] == key]['values'][0]", 
    globals=globals()).timeit(10*1000) , 'secs' )   # 13.256318201000001 secs

# Python built-in dictionary
d = {}
for k, v in zip(keyArr, valArr):
    d[k] = v

print(key, '=>', d[key])                            # 558520981 => 0.17948995177905835
print( timeit.Timer("d[key]",       
    globals=globals()).timeit(10*1000) , 'secs' )   # 0.0008061910000000116 secs

Numpy Lookup Method 1: 13.3 secs

val = data[data['keys'] == key]['values'][0]

Numpy Lookup Method 2: 13.4 secs

val = data['values'][np.where(data['keys'] == key)][0]

pandas.Series : 6.8 secs

import pandas as pd

# Pandas Series
s = pd.Series(valArr, index=keyArr, dtype=valArr.dtype)
val = s[key]
print(key, '=>', val)
print( timeit.Timer("s[key]", 
    globals=globals()).timeit(10*1000) , 'secs' )   # 6.782590246000002 secs

The main source of the problem is that lookup operations like these of numpy and pandas need to check every element in the list, as they are intended to perform multiple selection and more complex lookup operations, too. However, python dictionary can only perform single match lookups, but it's an optimal implementation with binary trees.

So, if your intention is to stick to key access, I don't think you'll find anything faster than a dictionary. Otherwise, I'd put my bet on pandas for the fastest access times.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM