[英]How can I implement a dictionary with a NumPy array?
I need to write a huge amount number-number pairs into a NumPy array. 我需要在NumPy数组中写入大量的数字对。 Since a lot of these pairs have a second value of 0, I thought of making something akin to a dictionary.
由于很多这些对的第二个值为0,我想要制作类似于字典的东西。 The problem is that I've read through the NumPy documentation on structured arrays and it seems like dictionaries built like those on the page can only use strings as keys.
问题是我已经阅读了关于结构化数组的NumPy文档,看起来像在页面上构建的字典只能使用字符串作为键。
Other than that, I need insertion and searching to have log(N) complexity. 除此之外,我需要插入和搜索以具有log(N)复杂性。 I thought of making my own Red-black tree structure using a regular NumPy array as storage, but I'm fairly certain there's an easier way to go about this.
我想使用常规的NumPy数组作为存储来制作我自己的红黑树结构,但我相当确定有一种更容易的方法。
Language is Python 2.7.12. 语言是Python 2.7.12。
So you have an (N,2)
array, and many values in x[:,1]
are 0. 所以你有一个
(N,2)
数组, x[:,1]
中的许多值都是0。
What do you mean by insertion
? insertion
是什么意思? Adding a value to the array to make it (N+1,2)
? 为数组添加一个值
(N+1,2)
? Or just changing x[i,:]
to something new? 或者只是将
x[i,:]
改为新的东西?
And what about the search? 搜索怎么样?
numpy
array are great for finding the ith values, x[i,:]
, but not that good for finding the values that match z
. numpy
数组非常适合查找第i个值x[i,:]
,但不适合查找与z
匹配的z
。 python numpy filter two dimentional array by condition python numpy过滤二维数组的条件
scipy.sparse
implements various forms of sparse matrix, which are useful if less than a tenth of the possible values are non-zero. scipy.sparse
实现了各种形式的稀疏矩阵,如果不到十分之一的可能值非零,则有用。 One format is dok
, a dictionary of keys. 一种格式是
dok
,一种密钥字典。 It is actually a dict
subclass, and the keys are a 2d index tuple (i,j)
. 它实际上是一个
dict
子类,键是2d索引元组(i,j)
。 Other formats store their values as arrays,eg row, cols and data. 其他格式将其值存储为数组,例如row,cols和data。
structured arrays
are meant for cases with a modest number of named fields, and each field can hold a different type of data. structured arrays
适用于具有适度数量的命名字段的情况,并且每个字段可以包含不同类型的数据。 But I don't think it helps to turn a (N,2)
array into a (N,)
array with 2 fields. 但我认为将
(N,2)
数组转换为具有2个字段的(N,)
数组并不会有帮助。
================ ================
Your comments suggest that you aren't familiar with how numpy
arrays are stored or accessed. 您的意见表明您不熟悉如何存储或访问
numpy
数组。
An array consists of a flat 1d data buffer
(just a c
array of bytes), and attributes like shape
, strides
, itemsize
and dtype
. 阵列由一个扁平1d的
data buffer
(只是c
字节阵列)和属性等shape
, strides
, itemsize
和dtype
。
Let's say it is np.arange(100)
. 假设它是
np.arange(100)
。
In [1324]: np.arange(100).__array_interface__
Out[1324]:
{'data': (163329128, False),
'descr': [('', '<i4')],
'shape': (100,),
'strides': (4,)
'typestr': '<i4',
'version': 3}
So if I ask for x[50]
, it calculates the strides, 4 bypes/element, * 50 elements = 200 bytes, and asks, in c
code for the 4 bytes at 163329128+200
, and it returns them as an integer (object of np.int32
type actually). 所以,如果我要求
x[50]
它计算的进步,4 bypes /元件,* 50个元素= 200个字节,并询问,在c
代码用于在4个字节163329128+200
,并将其返回它们作为一个整数(实际上是np.int32
类型的对象。
For a structured array the type descr and bytes per element will be larger, but access will be the same. 对于结构化数组,每个元素的类型descr和bytes将更大,但访问将是相同的。 For a 2d array it will take the shape and strides tuples into account to find the appropriate index.
对于二维数组,它将采用形状并将元组考虑在内以找到适当的索引。
Strides for a (N,2) integer array is (8,4). (N,2)整数数组的步幅为(8,4)。 So access to the
x[10,1]
element is with a 10*8 + 1*4 = 84
offset. 因此,访问
x[10,1]
元素的偏移量为10*8 + 1*4 = 84
。 And access to x[:,1]
is with i*8 for i in range...
offsets. 对
x[:,1]
是i*8 for i in range...
offsets。
But in all cases it relies on the values being arranged in a rectangular predicable pattern. 但在所有情况下,它依赖于以矩形可预测模式排列的值。 There's nothing fancy about the
numpy
data structures. numpy
数据结构没什么numpy
。 They are relatively fast simply because many operations are coded in compiled code. 它们相对较快,因为许多操作都是用编译代码编写的。
Sorting, accessing items by value, and rearranging elements are possible with arrays, but are not a strong point. 使用数组可以对值进行排序,按值访问项目以及重新排列元素,但这不是一个优点。 More often than not these actions will produce a new array, with values copied from old to new in some new pattern.
这些操作通常会生成一个新数组,其值以某种新模式从旧模式复制到新模式。
There are just a few builtin numpy
array subclasses, mainly np.matrix
and np.masked_array
, and they don't extend the access methods. 只有几个内置的
numpy
数组子类,主要是np.matrix
和np.masked_array
,它们不扩展访问方法。 Subclassing isn't as easy as with regular Python classes, since it numpy
has some much of its own compiled code. 子类化并不像普通的Python类那么容易,因为它的
numpy
有很多自己编译的代码。 A subclass has to have a __new__
method rather than regular __init__
. 子类必须具有
__new__
方法而不是常规__init__
。
There are Python modules that maintain sorted lists, bisect
and heapq
. 有一些Python模块可以维护排序列表,
bisect
和heapq
。 But I don't see how they will help you with the large out-of-ram memory issue. 但我不知道他们将如何帮助你解决大量的内存问题。
The most basic form of a dictionary is a structure called a HashMap
. 字典的最基本形式是称为
HashMap
的结构。 Implementing a hashmap relies on turning your key into a value that can be quickly looked up. 实现hashmap依赖于将密钥转换为可以快速查找的值。 A pathological example would be using
int
s as keys: The value for key 1
would go in array[1]
, the value for key 2
would go in array[2]
, the Hash Function is simply the identity function. 一个病态示例将使用
int
作为键:键1
的值将在array[1]
,键2
的值将在array[2]
,哈希函数只是标识函数。 You can easily implement that using a numpy array. 您可以使用numpy数组轻松实现它。
If you want to use other types, it's just a case of writing a good hash function to turn those keys into unique indexes into your array. 如果你想使用其他类型,只需要编写一个好的哈希函数来将这些键转换成你的数组中的唯一索引。 For example, if you know you've got a
(int, int)
tuple, and the first value will never be more than 100, you can do 100*key[1] + key[0]
. 例如,如果你知道你有一个
(int, int)
元组,并且第一个值永远不会超过100,你可以做100*key[1] + key[0]
。
The implementation of your hash function is what will make or break your dictionary replacement. 哈希函数的实现将决定你的字典替换。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.