How can I implement a dictionary with a NumPy array?

Question

I need to write a huge amount number-number pairs into a NumPy array. Since a lot of these pairs have a second value of 0, I thought of making something akin to a dictionary. The problem is that I've read through the NumPy documentation on structured arrays and it seems like dictionaries built like those on the page can only use strings as keys.

Other than that, I need insertion and searching to have log(N) complexity. I thought of making my own Red-black tree structure using a regular NumPy array as storage, but I'm fairly certain there's an easier way to go about this.

Language is Python 2.7.12.

Answer 1

So you have an (N,2) array, and many values in x[:,1] are 0.

What do you mean by insertion ? Adding a value to the array to make it (N+1,2) ? Or just changing x[i,:] to something new?

And what about the search? numpy array are great for finding the ith values, x[i,:] , but not that good for finding the values that match z . python numpy filter two dimentional array by condition

scipy.sparse implements various forms of sparse matrix, which are useful if less than a tenth of the possible values are non-zero. One format is dok , a dictionary of keys. It is actually a dict subclass, and the keys are a 2d index tuple (i,j) . Other formats store their values as arrays,eg row, cols and data.

structured arrays are meant for cases with a modest number of named fields, and each field can hold a different type of data. But I don't think it helps to turn a (N,2) array into a (N,) array with 2 fields.

================

Your comments suggest that you aren't familiar with how numpy arrays are stored or accessed.

An array consists of a flat 1d data buffer (just a c array of bytes), and attributes like shape , strides , itemsize and dtype .

Let's say it is np.arange(100) .

In [1324]: np.arange(100).__array_interface__
Out[1324]: 
{'data': (163329128, False),
 'descr': [('', '<i4')],
 'shape': (100,),
 'strides': (4,)
 'typestr': '<i4',
 'version': 3}

So if I ask for x[50] , it calculates the strides, 4 bypes/element, * 50 elements = 200 bytes, and asks, in c code for the 4 bytes at 163329128+200 , and it returns them as an integer (object of np.int32 type actually).

For a structured array the type descr and bytes per element will be larger, but access will be the same. For a 2d array it will take the shape and strides tuples into account to find the appropriate index.

Strides for a (N,2) integer array is (8,4). So access to the x[10,1] element is with a 10*8 + 1*4 = 84 offset. And access to x[:,1] is with i*8 for i in range... offsets.

But in all cases it relies on the values being arranged in a rectangular predicable pattern. There's nothing fancy about the numpy data structures. They are relatively fast simply because many operations are coded in compiled code.

Sorting, accessing items by value, and rearranging elements are possible with arrays, but are not a strong point. More often than not these actions will produce a new array, with values copied from old to new in some new pattern.

There are just a few builtin numpy array subclasses, mainly np.matrix and np.masked_array , and they don't extend the access methods. Subclassing isn't as easy as with regular Python classes, since it numpy has some much of its own compiled code. A subclass has to have a __new__ method rather than regular __init__ .

There are Python modules that maintain sorted lists, bisect and heapq . But I don't see how they will help you with the large out-of-ram memory issue.

Answer 2

The most basic form of a dictionary is a structure called a HashMap . Implementing a hashmap relies on turning your key into a value that can be quickly looked up. A pathological example would be using int s as keys: The value for key 1 would go in array[1] , the value for key 2 would go in array[2] , the Hash Function is simply the identity function. You can easily implement that using a numpy array.

If you want to use other types, it's just a case of writing a good hash function to turn those keys into unique indexes into your array. For example, if you know you've got a (int, int) tuple, and the first value will never be more than 100, you can do 100*key[1] + key[0] .

The implementation of your hash function is what will make or break your dictionary replacement.

How can I implement a dictionary with a NumPy array?

Question

2 answers

solution1
0 2016-08-12 18:07:45

solution2
0 2016-08-12 18:14:53

How can I implement a dictionary with a NumPy array?

Question

2 answers

solution1 0 2016-08-12 18:07:45

solution2 0 2016-08-12 18:14:53

solution1
0 2016-08-12 18:07:45

solution2
0 2016-08-12 18:14:53