Python numpy float16数据类型操作，还是float8？

Question

when performing math operations on float16 Numpy numbers, the result is also in float16 type number. 当对float16 Numpy数字执行数学运算时，结果也是float16类型数。 My question is how exactly the result is computed? 我的问题是如何计算结果？ Say Im multiplying/adding two float16 numbers, does python generate the result in float32 and then truncate/round the result to float16? 假设我乘以/添加两个float16数字，python是否在float32中生成结果然后将结果截断/舍入为float16？ Or does the calculation performed in '16bit multiplexer/adder hardware' all the way? 或者计算是否在“16位多路复用器/加法器硬件”中执行？

another question - is there a float8 type? 另一个问题 - 是否有float8类型？ I couldnt find this one... if not, then why? 我找不到这个......如果没有，那么为什么？ Thank-you all! 谢谢你们！

Answer 1

To the first question: there's no hardware support for float16 on a typical processor (at least outside the GPU). 对于第一个问题：在典型的处理器上（至少在GPU之外）没有对float16的硬件支持。 NumPy does exactly what you suggest: convert the float16 operands to float32 , perform the scalar operation on the float32 values, then round the float32 result back to float16 . NumPy完全按照你的建议：将float16操作数转换为float32 ，对float32值执行标量操作，然后将float32结果舍入为float16 。 It can be proved that the results are still correctly-rounded: the precision of float32 is large enough (relative to that of float16 ) that double rounding isn't an issue here, at least for the four basic arithmetic operations and square root. 可以证明结果仍然是正确舍入的： float32的精度足够大（相对于float16 ），这里双舍入不是问题，至少对于四个基本算术运算和平方根。

In the current NumPy source, this is what the definition of the four basic arithmetic operations looks like for float16 scalar operations. 在当前的NumPy源代码中，这就是float16标量运算的四种基本算术运算的定义。

#define half_ctype_add(a, b, outp) *(outp) = \
        npy_float_to_half(npy_half_to_float(a) + npy_half_to_float(b))
#define half_ctype_subtract(a, b, outp) *(outp) = \
        npy_float_to_half(npy_half_to_float(a) - npy_half_to_float(b))
#define half_ctype_multiply(a, b, outp) *(outp) = \
        npy_float_to_half(npy_half_to_float(a) * npy_half_to_float(b))
#define half_ctype_divide(a, b, outp) *(outp) = \
        npy_float_to_half(npy_half_to_float(a) / npy_half_to_float(b))

The code above is taken from scalarmath.c.src in the NumPy source. 上面的代码取自NumPy源代码中的scalarmath.c.src 。 You can also take a look at loops.c.src for the corresponding code for array ufuncs. 您还可以查看loops.c.src以获取数组ufuncs的相应代码。 The supporting npy_half_to_float and npy_float_to_half functions are defined in halffloat.c , along with various other support functions for the float16 type. 支持npy_half_to_float和npy_float_to_half函数在halffloat.c中定义，以及float16类型的各种其他支持函数。

For the second question: no, there's no float8 type in NumPy. 对于第二个问题：不，NumPy中没有float8类型。 float16 is a standardized type (described in the IEEE 754 standard), that's already in wide use in some contexts (notably GPUs). float16是一种标准化类型（在IEEE 754标准中描述），在某些情况下（特别是GPU）已广泛使用。 There's no IEEE 754 float8 type, and there doesn't appear to be an obvious candidate for a "standard" float8 type. 没有IEEE 754 float8类型，并且似乎没有明显的“标准” float8类型的候选者。 I'd also guess that there just hasn't been that much demand for float8 support in NumPy. 我还猜测在NumPy中对float8支持的需求并不多。

Answer 2

This answer builds on the float8 aspect of the question. 这个答案建立在问题的float8方面。 The accepted answer covers the rest pretty well.One of the main reasons there isn't a widely accepted float8 type, other than a lack of standard is that its not very useful practically. 接受的答案很好地涵盖了其余部分。除了缺乏标准之外，没有广泛接受的float8类型的一个主要原因是它实际上不是非常有用。

Primer on Floating Point 浮点入门

In standard notation, a float[n] data type is stored using n bits in memory. 在标准表示法中， float[n]数据类型使用内存中的n位存储。 That means that at most only 2^n unique values can be represented. 这意味着最多只能表示2^n唯一值。 In IEEE 754, a handful of these possible values, like nan , aren't even numbers as such. 在IEEE 754中，少数这些可能的值，如nan ，不是偶数。 That means all floating point representations (even if you go float256 ) have gaps in the set of rational numbers that they are able to represent and they round to the nearest value if you try to get a representation for a number in this gap. 这意味着所有浮点表示（即使你去float256 ）在它们能够表示的有理数集合中有间隙，如果你试图在这个间隙中得到一个数字的表示，它们会舍入到最接近的值。 Generally the higher the n , the smaller these gaps are. 通常， n越高，这些间隙越小。

You can see the gap in action if you use the struct package to get the binary representation of some float32 numbers. 如果使用struct包来获取某些float32数字的二进制表示，则可以看到操作中的差距。 Its a bit startling to run into at first but there's a gap of 32 just in the integer space: 最初遇到它有点令人吃惊但是在整数空间中只有32的差距：

import struct

billion_as_float32 = struct.pack('f', 1000000000 + i)
for i in range(32):
    billion_as_float32 == struct.pack('f', 1000000001 + i) // True

Generally, floating point is best at tracking only the most significant bits so that if your numbers have the same scale, the important differences are preserved. 通常，浮点最好只跟踪最重要的位，这样如果您的数字具有相同的比例，则保留重要的差异。 Floating point standards generally differ only in the way they distribute the available bits between a base and an exponent. 浮点标准通常仅在它们在基数和指数之间分配可用位的方式上有所不同。 For instance, IEEE 754 float32 uses 24 bits for the base and 8 bits for the exponent. 例如，IEEE 754 float32使用24位作为基数，8位作为指数。

Back to `float8` 回到`float8`

By the above logic, a float8 value can only ever take on 256 distinct values, no matter how clever you are in splitting the bits between base and exponent. 通过上面的逻辑， float8值只能占用256个不同的值，无论你在基数和指数之间分割比特有多聪明。 Unless you're keen on it rounding numbers to one of 256 arbitrary numbers clustered near zero its probably more efficient to just track the 256 possibilities in a int8 . 除非你热衷于将数字四舍五入到接近零的256个任意数字之一，否则它可能更有效地跟踪int8的256种可能性。

For instance, if you wanted to track a very small range with coarse precision you could divide the range you wanted into 256 points and then store which of the 256 points your number was closest to. 例如，如果您想要以粗略的精度跟踪非常小的范围，您可以将所需的范围划分为256个点，然后存储您的数字最接近的256个点中的哪一个。 If you wanted to get really fancy you could have a non-linear distribution of values either clustered at the centre or at the edges depending on what mattered most to you. 如果你想得到真正的幻想，你可以有一个非线性的值分布，无论是聚集在中心还是边缘，取决于你最重要的事情。

The likelihood of anyone else (or even yourself later on) needing this exact scheme is extremely small and most of the time the extra byte or 3 you pay as a penalty for using float16 or float32 instead is too small to make a meaningful difference. 任何其他人（甚至自己以后）需要这种精确方案的可能性非常小，而且大多数情况下，使用float16或float32作为惩罚的额外字节或3支付的代价太小而无法产生有意义的差异。 Hence...almost no one bothers to write up a float8 implementation. 因此......几乎没有人想要写一个float8实现。

Python numpy float16数据类型操作，还是float8？

问题描述

2 个解决方案

解决方案1
11 已采纳 2016-08-16 14:45:27

解决方案2
2 2016-11-09 12:38:44

Primer on Floating Point 浮点入门

Back to `float8` 回到`float8`

Python numpy float16数据类型操作，还是float8？

问题描述

2 个解决方案

解决方案1 11 已采纳 2016-08-16 14:45:27

解决方案2 2 2016-11-09 12:38:44

Primer on Floating Point 浮点入门

Back to float8 回到float8

解决方案1
11 已采纳 2016-08-16 14:45:27

解决方案2
2 2016-11-09 12:38:44

Back to `float8` 回到`float8`