简体   繁体   English

np.ndarray.tobytes() 如何为 dtype“对象”工作?

[英]How does np.ndarray.tobytes() work for dtype "object"?

I encountered a strange behavior of np.ndarray.tobytes() that makes me doubt that it is working deterministically, at least for arrays of dtype=object .我遇到了np.ndarray.tobytes()一个奇怪行为,这让我怀疑它是否确定性地工作,至少对于dtype=object数组。

import numpy as np
print(np.array([1,[2]]).dtype)
# => object
print(np.array([1,[2]]).tobytes())
# => b'0h\xa3\t\x01\x00\x00\x00H{!-\x01\x00\x00\x00'
print(np.array([1,[2]]).tobytes())
# => b'0h\xa3\t\x01\x00\x00\x00\x88\x9d)-\x01\x00\x00\x00'

In the sample code, a list of mixed python objects ( [1, [2]] ) is first converted to a numpy array, and then transformed to a byte sequence using tobytes() .在示例代码中,首先将混合 Python 对象列表 ( [1, [2]] ) 转换为 numpy 数组,然后使用tobytes()转换为字节序列。

Why do the resulting byte-representations differ for repeated instantiations of the same data?为什么相同数据的重复实例化得到的字节表示不同? The documentation just states that it converts an ndarray to raw python bytes, but it does not refer to any limitations.文档仅说明它将ndarray转换为原始 python 字节,但并未提及任何限制。 So far, I observed this just for dtype=object .到目前为止,我仅针对dtype=object观察到了这一点。 Numeric arrays always yield the same byte sequence:数字数组总是产生相同的字节序列:

np.random.seed(42); print(np.random.rand(3).tobytes())
# b'\xecQ_\x1ew\xf8\xd7?T\xd6\xbbh@l\xee?Qg\x1e\x8f~l\xe7?'
np.random.seed(42); print(np.random.rand(3).tobytes())
# b'\xecQ_\x1ew\xf8\xd7?T\xd6\xbbh@l\xee?Qg\x1e\x8f~l\xe7?'

Have I missed an elementar thing about python's/numpy's memory architecture?我是否错过了有关 python/numpy 内存架构的基本知识? I tested with numpy version 1.17.2 on a Mac.我在 Mac 上使用 numpy 1.17.2 版进行了测试。


Context : I encountered this problem when trying to compute a hash for arbitrary data structures.上下文:我在尝试计算任意数据结构的哈希时遇到了这个问题。 I hoped that I can rely on the basic serialization capabilities of tobytes() , but this appears to be a wrong premise.我希望我可以依赖tobytes()的基本序列化功能,但这似乎是一个错误的前提。 I know that pickle is the standard for serialization in python, but since I don't require portability and my data structures only contain numbers, I first sought help with numpy.我知道pickle是python中序列化的标准,但由于我不需要可移植性,而且我的数据结构只包含数字,我首先寻求numpy的帮助。

An array of dtype object stores pointers to the objects it contains.一个 dtype object数组存储指向它包含的对象的指针。 In CPython, this corresponds to the id .在 CPython 中,这对应于id Every time you create a new list, it will be allocated at a new memory address.每次创建一个新列表时,它都会被分配到一个新的内存地址。 However, small integers are interned, so 1 will reference the same integer object every time.但是,小整数是实习的,所以1每次都会引用同一个整数对象。

You can see exactly how this works by checking the IDs of some sample objects:您可以通过检查一些示例对象的 ID 来确切了解这是如何工作的:

>>> x = np.array([1, [2]])
>>> x.tobytes()
b'\x90\x91\x04a\xfb\x7f\x00\x00\xc8[J\xaa+\x02\x00\x00'
>>> id(x[0])
140717641208208
>>> id(1)                             # Small integers are interned
140717641208208
>>> id(x[0]).to_bytes(8, 'little')    # Checks out as the first 8 bytes
b'\x90\x91\x04a\xfb\x7f\x00\x00'
>>> id(x[1]).to_bytes(8, 'little')    # Checks out as the last 8 bytes
b'\xc8[J\xaa+\x02\x00\x00'

As you can see, it is quite deterministic, but serializes information that is essentially useless to you.正如您所看到的,它是非常确定的,但是序列化了对您来说基本上无用的信息。 The operation is the same for numeric arrays as for object arrays: it returns a view or copy of the underlying buffer.数值数组的操作与对象数组的操作相同:它返回底层缓冲区的视图或副本。 The contents of the buffer is what is throwing you off.缓冲区的内容是什么让你失望。

Since you mentioned that you are computing hashes, keep in mind that there is a reason that python lists are unhashable.既然您提到您正在计算哈希,请记住,python 列表不可哈希是有原因的。 You can have lists that are equal at one time and different at another.您可以拥有一次相同而另一次不同的列表。 Using IDs is generally not a good idea for an effective hash.对于有效的散列,使用 ID 通常不是一个好主意。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM