简体   繁体   English

混合浮点数和字符串的不一致 dtype 推断

[英]Inconsistent dtype inference for mixed floats and strings

np.array([5.3, 1.2, 76.1, 'Alice', 'Bob', 'Claire'])

我想知道为什么这给出了 dtype=U32 的 dtype,但是下面的代码给出了 U6 的 dtype。

np.array(['Alice', 'Bob', 'Claire', 5.3, 1.2, 76.1])

Numpy tries to be efficient when storing datatypes by calculating how many bits it will take to store an object. Numpy 试图通过计算存储对象需要多少位来提高存储数据类型的效率。

import np
a = np.array([5.3, 1.2, 76.1, 'Alice', 'Bob', 'Claire'])
b = np.array(['Alice', 'Bob', 'Claire', 5.3, 1.2, 76.1])
print(a.dtype, b.dtype)

>>> <U32 <U6

Numpy sees 5.3 and puts it into a datatype which is a 32-codepoint data-type due to the datatype conversion rules:由于数据类型转换规则,Numpy 看到5.3并将其放入一个 32-codepoint 数据类型的数据类型中:

Type of the data (integer, float, Python object, etc.)数据类型(整数、浮点数、Python 对象等)

Size of the data (how many bytes is in eg the integer)数据的大小(例如整数中有多少字节)

Byte order of the data (little-endian or big-endian)数据的字节顺序(小端或大端)

If the data type is structured data type, an aggregate of other data types, (eg, describing an array item consisting of an integer and a float),如果数据类型是结构化数据类型,则是其他数据类型的聚合(例如,描述由整数和浮点数组成的数组项),

what are the names of the “fields” of the structure, by which they can be accessed,可以访问它们的结构的“字段”的名称是什么,

what is the data-type of each field, and每个字段的数据类型是什么,以及

which part of the memory block each field takes.每个字段占用内存块的哪一部分。

If the data type is a sub-array, what is its shape and data type.如果数据类型是子数组,它的形状和数据类型是什么。

When it sees the other strings in the array, they can fit within the 32-codepoint data-type and so it doesn't have to be changed.当它看到数组中的其他字符串时,它们可以适合 32 码点数据类型,因此不必更改。

Now, consider the second example.现在,考虑第二个例子。 Numpy sees Alice and puts it into a datatype which can hold six bits. Numpy 看到Alice并将其放入一个可以容纳六位的数据类型中。 Numpy continues along and sees 5.3 , which can also be fit into a 6-codepoint data-type. Numpy 继续并看到5.3 ,它也可以适合 6-codepoint 数据类型。 So no upgrading is required.所以不需要升级。

Similarly, when running np.array(['Alice', 'Bob', 'Claire', 5.3, 1.2, 76.1, 'Bobby', 2.3000000000001]) it results in a U15 as Numpy sees 2.3000000000001 and finds out that the datatype that it is using is not large enough to hold 2.3000000000001 and then upgrades it.类似地,当运行np.array(['Alice', 'Bob', 'Claire', 5.3, 1.2, 76.1, 'Bobby', 2.3000000000001])它会产生一个U15因为 Numpy 看到2.3000000000001的数据类型它使用的大小不足以容纳2.3000000000001然后升级它。

https://numpy.org/devdocs/reference/arrays.dtypes.html#arrays-dtypes https://numpy.org/devdocs/reference/arrays.dtypes.html#arrays-dtypes

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM