简体   繁体   English

将 NumPy arrays 重新解释为不同的 dtype

[英]Reinterpreting NumPy arrays as a different dtype

Say I have a large NumPy array of dtype int32假设我有一个dtype int32的大型 NumPy 数组

import numpy as np
N = 1000  # (large) number of elements
a = np.random.randint(0, 100, N, dtype=np.int32)

but now I want the data to be uint32 .但现在我希望数据是uint32 I could do我可以

b = a.astype(np.uint32)

or even甚至

b = a.astype(np.uint32, copy=False)

but in both cases b is a copy of a , whereas I want to simply reinterpret the data in a as being uint32 , as to not duplicate the memory.但在这两种情况下b都是a的副本,而我想简单地将a中的数据重新解释为uint32 ,以免重复 memory。 Similarly, using np.asarray() does not help.同样,使用np.asarray()也无济于事。

What does work is工作是什么

a.dtpye = np.uint32

which simply changes the dtype without altering the data at all.它只是改变了dtype而不改变数据。 Here's a striking example:这是一个引人注目的例子:

import numpy as np
a = np.array([-1, 0, 1, 2], dtype=np.int32)
print(a)
a.dtype = np.uint32
print(a)  # shows "overflow", which is what I want

My questions are about the solution of simply overwriting the dtype of the array:我的问题是关于简单地覆盖数组的dtype的解决方案:

  1. Is this legitimate?这是合法的吗? Can you point me to where this feature is documented?你能指出我在哪里记录了这个功能吗?
  2. Does it in fact leave the data of the array untouched, ie no duplication of the data?它实际上是否使数组的数据保持不变,即没有数据重复?
  3. What if I want two arrays a and b sharing the same data, but view it as different dtype s?如果我想要两个 arrays ab共享相同的数据,但将其视为不同的dtype怎么办? I've found the following to work, but again I'm concerned if this is really OK to do:我发现以下方法可行,但我再次担心这样做是否真的可以:
     import numpy as np a = np.array([0, 1, 2, 3], dtype=np.int32) b = a.view(np.uint32) print(a) # [0 1 2 3] print(b) # [0 1 2 3] a[0] = -1 print(a) # [-1 1 2 3] print(b) # [4294967295 1 2 3]
    Though this seems to work, I find it weird that the underlying data of the two arrays does not seem to be located the same place in memory:虽然这似乎可行,但我觉得奇怪的是,两个 arrays 的基础data似乎不在 memory 中的同一位置:
     print(a.data) print(b.data)
    Actually, it seems that the above gives different results each time it is run, so I don't understand what's going on there at all.实际上,上面似乎每次运行时都会给出不同的结果,所以我根本不明白那里发生了什么。
  4. This can be extended to other dtype s, the most extreme of which is probably mixing 32 and 64 bit floats:这可以扩展到其他dtype ,其中最极端的可能是混合 32 位和 64 位浮点数:
     import numpy as np a = np.array([0, 1, 2, np.pi], dtype=np.float32) b = a.view(np.float64) print(a) # [0. 1. 2. 3.1415927] print(b) # [0.0078125 50.12387848] b[0] = 8 print(a) # [0. 2.5 2. 3.1415927] print(b) # [8. 50.12387848]
    Again, is this condoned, if the obtained behaviour is really what I'm after?同样,如果获得的行为真的是我所追求的,这是否被宽恕?
  1. Is this legitimate?这是合法的吗? Can you point me to where this feature is documented?你能指出我在哪里记录了这个功能吗?

This is legitimate.这是合法的。 However, using np.view (which is equivalent) is better since it is compatible with a static analysers (so it is somehow safer).但是,使用np.view (等效)更好,因为它与 static 分析器兼容(因此它在某种程度上更安全)。 Indeed, the documentation states:事实上, 文档指出:

It's possible to mutate the dtype of an array at runtime.可以在运行时改变数组的dtype [...] This sort of mutation is not allowed by the types. [...] 类型不允许这种突变。 Users who want to write statically typed code should instead use the numpy.ndarray.view method to create a view of the array with a different dtype .想要编写静态类型代码的用户应该改用numpy.ndarray.view方法来创建具有不同dtype的数组视图。

  1. Does it in fact leave the data of the array untouched, ie no duplication of the data?它实际上是否使数组的数据保持不变,即没有数据重复?

Yes.是的。 Since the array is still a view on the same internal memory buffer (a basic byte array).由于该数组仍然是同一内部 memory 缓冲区(基本字节数组)上的视图 Numpy will just reinterpret it differently (this is directly done the C code of each Numpy computing function). Numpy 将只是重新解释它不同(这是直接完成每个 Numpy 计算函数的 C 代码)。

  1. What if I want two arrays a and b sharing the same data, but view it as different dtypes ?如果我想要两个dtypes ab共享相同的数据,但将其视为不同的数据类型怎么办? [...] [...]

np.view can be used in this case as you did in your example. np.view可以在这种情况下使用,就像您在示例中所做的那样。 However, the result is platform dependent .但是,结果取决于平台 Indeed, Numpy just reinterpret bytes of memory and theoretically the representation of negative numbers can change from one machine to another.实际上,Numpy 只是重新解释 memory 的字节,理论上负数的表示可以从一台机器更改为另一台机器。 Hopefully, nowadays, all mainstream modern processors use use the two's complement ( source ).希望现在,所有主流的现代处理器都使用二进制补码source )。 This means that a np.in32 value like -1 will be reinterpreted as 2**32-1 = 4294967295 with a view of type np.uint32 .这意味着像-1这样的np.in32值将被重新解释为2**32-1 = 4294967295 ,视图类型为np.uint32 Positive signed values are unchanged.正符号值不变。 As long as you are aware of this, this is fine and the behaviour is predictable.只要您意识到这一点,这很好,并且行为是可预测的。

  1. This can be extended to other dtypes , the most extreme of which is probably mixing 32 and 64 bit floats.这可以扩展到其他dtypes ,其中最极端的可能是混合 32 位和 64 位浮点数。

Well, put it shortly, this is really like playing fire .嗯,简而言之,这真的很像玩火 In this case this should be unsafe although it may work on your specific machine.在这种情况下,这应该是不安全的,尽管它可能适用于您的特定机器。 Let us venturing into troubled waters.让我们冒险进入困境。

First of all, the documentation of np.view states:首先, np.view的文档指出:

The behavior of the view cannot be predicted just from the superficial appearance of a.仅从 a 的表面外观无法预测视图的行为。 It also depends on exactly how a is stored in memory.它还取决于 a 在 memory 中的存储方式。 Therefore if a is C-ordered versus fortran-ordered, versus defined as a slice or transpose, etc., the view may give different results.因此,如果 a 是 C-ordered 与 fortran-ordered,与定义为切片或转置等,则视图可能会给出不同的结果。

The thing is Numpy reinterpret the pointer using a C code.问题是 Numpy 使用 C 代码重新解释指针。 Thus, AFAIK, the strict aliasing rule applies.因此,AFAIK, 严格的别名规则适用。 This means that reinterpreting a np.float32 value to a np.float64 cause an undefined behaviour .这意味着将np.float32值重新解释为np.float64会导致未定义的行为 One reason is that the alignment requirements are not the same for np.float32 (typically 4) and np.float32 (typically 8) and so reading an unaligned np.float64 value from memory can cause a crash on some architecture (eg. POWER) although x86-64 processors support this.一个原因是 alignment 对np.float32 (通常为 4)和np.float32 (通常为 8)的要求不同,因此从 memory 读取未对齐的np.float64值(可能导致某些架构崩溃)尽管 x86-64 处理器支持这一点。 Another reason comes from the compiler which can over-optimize the code due to the strict aliasing rule by making wrong assumptions in your case (like a np.float32 value and a np.float64 value cannot overlap in memory so the modification of the view should not change the original array).另一个原因来自编译器,由于严格的别名规则,编译器可以通过在您的情况下做出错误的假设来过度优化代码(例如np.float32值和np.float64值在 memory 中不能重叠,因此视图的修改应该不改变原始数组)。 However, since Numpy is called from CPython and no function calls are inlined from the interpreter (probably not with Cython), this last point should not be a problem (it may be the case be if you use Numba or any JIT though).但是,由于从 CPython 调用 Numpy 并且没有从解释器内联 function 调用(可能不使用 Cython),所以最后一点应该不是问题(如果您使用 Numba 或任何 JIT 可能就是这种情况)。 Note that this is safe to get an np.uint8 view of a np.float32 since it does not break the strict aliasing rule (and the alignment is Ok).请注意,获取np.float32np.uint8视图是安全的,因为它不会违反严格的别名规则(并且 alignment 可以)。 This could be useful to efficiently serialize Numpy arrays.这对于有效地序列化 Numpy arrays 很有用。 The opposite operation is not safe (especially due to the alignment).相反的操作是不安全的(特别是由于对齐)。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM