简体   繁体   English

如何使用NumPy数组实现字典?

[英]How can I implement a dictionary with a NumPy array?

I need to write a huge amount number-number pairs into a NumPy array. 我需要在NumPy数组中写入大量的数字对。 Since a lot of these pairs have a second value of 0, I thought of making something akin to a dictionary. 由于很多这些对的第二个值为0,我想要制作类似于字典的东西。 The problem is that I've read through the NumPy documentation on structured arrays and it seems like dictionaries built like those on the page can only use strings as keys. 问题是我已经阅读了关于结构化数组的NumPy文档,看起来像在页面上构建的字典只能使用字符串作为键。

Other than that, I need insertion and searching to have log(N) complexity. 除此之外,我需要插入和搜索以具有log(N)复杂性。 I thought of making my own Red-black tree structure using a regular NumPy array as storage, but I'm fairly certain there's an easier way to go about this. 我想使用常规的NumPy数组作为存储来制作我自己的红黑树结构,但我相当确定有一种更容易的方法。

Language is Python 2.7.12. 语言是Python 2.7.12。

So you have an (N,2) array, and many values in x[:,1] are 0. 所以你有一个(N,2)数组, x[:,1]中的许多值都是0。

What do you mean by insertion ? insertion是什么意思? Adding a value to the array to make it (N+1,2) ? 为数组添加一个值(N+1,2) Or just changing x[i,:] to something new? 或者只是将x[i,:]改为新的东西?

And what about the search? 搜索怎么样? numpy array are great for finding the ith values, x[i,:] , but not that good for finding the values that match z . numpy数组非常适合查找第i个值x[i,:] ,但不适合查找与z匹配的z python numpy filter two dimentional array by condition python numpy过滤二维数组的条件

scipy.sparse implements various forms of sparse matrix, which are useful if less than a tenth of the possible values are non-zero. scipy.sparse实现了各种形式的稀疏矩阵,如果不到十分之一的可能值非零,则有用。 One format is dok , a dictionary of keys. 一种格式是dok ,一种密钥字典。 It is actually a dict subclass, and the keys are a 2d index tuple (i,j) . 它实际上是一个dict子类,键是2d索引元组(i,j) Other formats store their values as arrays,eg row, cols and data. 其他格式将其值存储为数组,例如row,cols和data。

structured arrays are meant for cases with a modest number of named fields, and each field can hold a different type of data. structured arrays适用于具有适度数量的命名字段的情况,并且每个字段可以包含不同类型的数据。 But I don't think it helps to turn a (N,2) array into a (N,) array with 2 fields. 但我认为将(N,2)数组转换为具有2个字段的(N,)数组并不会有帮助。

================ ================

Your comments suggest that you aren't familiar with how numpy arrays are stored or accessed. 您的意见表明您不熟悉如何存储或访问numpy数组。

An array consists of a flat 1d data buffer (just a c array of bytes), and attributes like shape , strides , itemsize and dtype . 阵列由一个扁平1d的data buffer (只是c字节阵列)和属性等shapestridesitemsizedtype

Let's say it is np.arange(100) . 假设它是np.arange(100)

In [1324]: np.arange(100).__array_interface__
Out[1324]: 
{'data': (163329128, False),
 'descr': [('', '<i4')],
 'shape': (100,),
 'strides': (4,)
 'typestr': '<i4',
 'version': 3}

So if I ask for x[50] , it calculates the strides, 4 bypes/element, * 50 elements = 200 bytes, and asks, in c code for the 4 bytes at 163329128+200 , and it returns them as an integer (object of np.int32 type actually). 所以,如果我要求x[50]它计算的进步,4 bypes /元件,* 50个元素= 200个字节,并询问,在c代码用于在4个字节163329128+200 ,并将其返回它们作为一个整数(实际上是np.int32类型的对象。

For a structured array the type descr and bytes per element will be larger, but access will be the same. 对于结构化数组,每个元素的类型descr和bytes将更大,但访问将是相同的。 For a 2d array it will take the shape and strides tuples into account to find the appropriate index. 对于二维数组,它将采用形状并将元组考虑在内以找到适当的索引。

Strides for a (N,2) integer array is (8,4). (N,2)整数数组的步幅为(8,4)。 So access to the x[10,1] element is with a 10*8 + 1*4 = 84 offset. 因此,访问x[10,1]元素的偏移量为10*8 + 1*4 = 84 And access to x[:,1] is with i*8 for i in range... offsets. x[:,1]i*8 for i in range... offsets。

But in all cases it relies on the values being arranged in a rectangular predicable pattern. 但在所有情况下,它依赖于以矩形可预测模式排列的值。 There's nothing fancy about the numpy data structures. numpy数据结构没什么numpy They are relatively fast simply because many operations are coded in compiled code. 它们相对较快,因为许多操作都是用编译代码编写的。

Sorting, accessing items by value, and rearranging elements are possible with arrays, but are not a strong point. 使用数组可以对值进行排序,按值访问项目以及重新排列元素,但这不是一个优点。 More often than not these actions will produce a new array, with values copied from old to new in some new pattern. 这些操作通常会生成一个新数组,其值以某种新模式从旧模式复制到新模式。

There are just a few builtin numpy array subclasses, mainly np.matrix and np.masked_array , and they don't extend the access methods. 只有几个内置的numpy数组子类,主要是np.matrixnp.masked_array ,它们不扩展访问方法。 Subclassing isn't as easy as with regular Python classes, since it numpy has some much of its own compiled code. 子类化并不像普通的Python类那么容易,因为它的numpy有很多自己编译的代码。 A subclass has to have a __new__ method rather than regular __init__ . 子类必须具有__new__方法而不是常规__init__

There are Python modules that maintain sorted lists, bisect and heapq . 有一些Python模块可以维护排序列表, bisectheapq But I don't see how they will help you with the large out-of-ram memory issue. 但我不知道他们将如何帮助你解决大量的内存问题。

The most basic form of a dictionary is a structure called a HashMap . 字典的最基本形式是称为HashMap的结构。 Implementing a hashmap relies on turning your key into a value that can be quickly looked up. 实现hashmap依赖于将密钥转换为可以快速查找的值。 A pathological example would be using int s as keys: The value for key 1 would go in array[1] , the value for key 2 would go in array[2] , the Hash Function is simply the identity function. 一个病态示例将使用int作为键:键1的值将在array[1] ,键2的值将在array[2] ,哈希函数只是标识函数。 You can easily implement that using a numpy array. 您可以使用numpy数组轻松实现它。

If you want to use other types, it's just a case of writing a good hash function to turn those keys into unique indexes into your array. 如果你想使用其他类型,只需要编写一个好的哈希函数来将这些键转换成你的数组中的唯一索引。 For example, if you know you've got a (int, int) tuple, and the first value will never be more than 100, you can do 100*key[1] + key[0] . 例如,如果你知道你有一个(int, int)元组,并且第一个值永远不会超过100,你可以做100*key[1] + key[0]

The implementation of your hash function is what will make or break your dictionary replacement. 哈希函数的实现将决定你的字典替换。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM