简体繁体 English

如果我将float64数组压缩到numpy中的float32，我应该担心什么？

[英]What should I worry about if I compress float64 array to float32 in numpy?

原文 2012-06-13 01:42:19 3 3 python/ numpy/ floating-point/ compression

This is a particular kind of lossy compression that's quite easy to implement in numpy. 这是一种特殊的有损压缩，很容易在numpy中实现。

I could in principle directly compare original (float64) to reconstructed (float64(float32(original)) and know things like the maximum error. 我原则上可以直接比较原始（float64）和重建（float64（float32（原始））并知道最大错误之类的事情。

Other than looking at the maximum error for my actual data, does anybody have a good idea what type of distortions this creates, eg as a function of the magnitude of the original value? 除了查看我的实际数据的最大误差之外，是否有人知道这会产生什么类型的失真，例如作为原始值的大小的函数？

Would I be better off mapping all values (in 64-bits) onto say [-1,1] first (as a fraction of extreme values, which could be preserved in 64-bits) to take advantage of greater density of floats near zero? 我会更好地将所有值（以64位为单位）映射到首先说[-1,1]（作为极值的一小部分，可以保留在64位中）以利用更接近零的浮点密度？

I'm adding a specific case I have in mind. 我正在添加一个我想到的具体案例。 Let's say I have 500k to 1e6 values ranging from -20 to 20, that are approximately IID ~ Normal(mu=0,sigma=4) so they're already pretty concentrated near zero and the "20" is ~5-sigma rare. 假设我有500k到1e6的值，范围从-20到20，大约是IID~Normal（mu = 0，sigma = 4），所以它们已经非常集中在零附近，“20”是〜5-sigma罕见。 Let's say they are scientific measurements where the true precision is a whole lot less than the 64-bit floats, but hard to really know exactly. 让我们说它们是科学测量，其中真正的精度比64位浮点数少很多，但很难确切地知道。 I have tons of separate instances (potentially TB's worth) so compressing has a lot of practical value, and float32 is a quick way to get 50% (and if anything, works better with an additional round of lossless compression like gzip). 我有大量单独的实例（可能是TB的价值），因此压缩具有很多实用价值，而float32是获得50％的快速方法（如果有的话，通过gzip等额外的无损压缩更好地工作）。 So the "-20 to 20" eliminates a lot of concerns about really large values. 所以“-20到20”消除了很多关于真正大值的担忧。

3 个解决方案

The following assumes you are using standard IEEE-754 floating-point operations, which are common (with some exceptions), in the usual round-to-nearest mode. 以下假设您使用标准IEEE-754浮点运算，这些运算在通常的舍入到最近模式下是常见的（有一些例外）。

If a double value is within the normal range of float values, then the only change that occurs when the double is rounded to a float is that the significand (fraction portion of the value) is rounded from 53 bits to 24 bits. 如果double值在float值的正常范围内，那么当double被舍入到float时发生的唯一变化是有效位数（值的小数部分）从53位舍入到24位。 This will cause an error of at most 1/2 ULP (unit of least precision). 这将导致最多1/2 ULP（精度最低的单位）的误差。 The ULP of a float is 2 ^-23 times the greatest power of two not greater than the float. 浮子的ULP是两个不大于浮子的最大功率的2 ^-23倍。 Eg, if a float is 7.25, the greatest power of two not greater than it is 4, so its ULP is 4*2 ^-23 = 2 ^-21 , about 4.77e-7. 例如，如果浮点数为7.25，则2的最大幂不大于4，因此其ULP为4 * 2 ^-23 = 2 ^-21 ，约为4.77e-7。 So the error when double in the interval [4, 8) is converted to float is at most 2 ^-22 , about 2.38e-7. 因此，当间隔[4,8]中的双精度转换为浮点时的误差最多为2 ^-22 ，约为2.38e-7。 For another example, if a float is about .03, the greatest power of two not greater than it is 2 ^-6 , so the ULP is 2 ^-29 , and the maximum error when converting to double is 2 ^-30 . 再举一个例子，如果一个浮点数约为0.03，那么两个不大于2的最大幂是2 ^-6 ，所以ULP是2 ^-29 ，转换为double时的最大误差是2 ^-30 。

Those are absolute errors. 那是绝对的错误。 The relative error is less than 2 ^-24 , which is 1/2 ULP divided by the smallest the value could be (the smallest value in the interval for a particular ULP, so the power of two that bounds it). 相对误差小于2 ^-24 ，即1/2 ULP除以该值可能的最小值（特定ULP的间隔中的最小值，因此限制它的2的幂）。 Eg, for each number x in [4, 8), we know the number is at least 4 and error is at most 2 ^-22 , so the relative error is at most 2 ^-22 /4 = 2 ^-24 . 例如，对于[4,8]中的每个数字x，我们知道该数字至少为4且误差最多为2 ^-22 ，因此相对误差最多为2 ^-22 / 4 = 2 ^-24 。 (The error cannot be exactly 2 ^-24 because there is no error when converting an exact power of two from float to double, so there is an error only if x is greater than four, so the relative error is less than, not equal to, 2 ^-24 .) When you know more about the value being converted, eg, it is nearer 8 than 4, you can bound the error more tightly. （错误不能正好是2到^24，因为将精确的2的幂从float转换为double时没有错误，所以只有当x大于4时才会出错，所以相对误差小于，不等于，2 ^-24 。）当你对被转换的值有更多了解时，例如，它接近8而不是4，你可以更严格地约束错误。

If the number is outside the normal range of a float, errors can be larger. 如果数字超出浮点数的正常范围，则错误可能会更大。 The maximum finite floating-point value is 2 ¹²⁸ -2 ¹⁰⁴ , about 3.40e38. 最大有限浮点值为2 ¹²⁸ -2 ¹⁰⁴ ，约为3.40e38。 When you convert a double that is 1/2 ULP (of a float; doubles have finer ULP) more than that or greater to float, infinity is returned, which is, of course, an infinite absolute error and an infinite relative error. 当你将1/2 ULP（浮点数;双精度具有更精细的ULP）的双精度值转换为浮点数或更大值时，返回无穷大，当然，这是一个无限的绝对误差和无限的相对误差。 (A double that is greater than the maximum finite float but is greater by less than 1/2 ULP is converted to the maximum finite float and has the same errors discussed in the previous paragraph.) （一个大于最大有限浮点数但大于小于1/2 ULP的双精度转换为最大有限浮点数并具有前一段中讨论的相同误差。）

The minimum positive normal float is 2 ^-126 , about 1.18e-38. 正常正常浮动的最小值为2 ^-126 ，约为1.18e-38。 Numbers within 1/2 ULP of this (inclusive) are converted to it, but numbers less than that are converted to a special denormalized format, where the ULP is fixed at 2 ^-149 . 将此（包括）的1/2 ULP内的数字转换为它，但小于该数字的数字将转换为特殊的非规范化格式，其中ULP固定为2 ^-149 。 The absolute error will be at most 1/2 ULP, 2 ^-150 . 绝对误差最多为1/2 ULP，2 ^-150 。 The relative error will depend significantly on the value being converted. 相对误差将在很大程度上取决于转换的值。

The above discusses positive numbers. 以上讨论了正数。 The errors for negative numbers are symmetric. 负数的误差是对称的。

If the value of a double can be represented exactly as a float, there is no error in conversion. 如果double的值可以完全表示为float，则转换中没有错误。

Mapping the input numbers to a new interval can reduce errors in specific situations. 将输入数字映射到新间隔可以减少特定情况下的错误。 As a contrived example, suppose all your numbers are integers in the interval [2 ⁴⁸ , 2 ⁴⁸ +2 ²⁴ ). 作为一个人为的例子，假设你的所有数字都是区间内的整数[2 ^48,2 ⁴⁸ +2 ²⁴ ]。 Then converting them to float would lose all information that distinguishes the values; 然后将它们转换为float将丢失区分值的所有信息; they would all be converted to 2 ⁴⁸ . 他们都将被转换为2 ⁴⁸ 。 But mapping them to [0, 2 ²⁴ ) would preserve all information; 但将它们映射到[0,2 ²⁴ ]将保留所有信息; each different input would be converted to a different result. 每个不同的输入将转换为不同的结果。

Which map would best suit your purposes depends on your specific situation. 哪种地图最适合您的目的取决于您的具体情况。

It is unlikely that a simple transformation will reduce error significantly, since your distribution is centered around zero. 简单的转换不太可能显着减少错误，因为您的分布以零为中心。

Scaling can have effect in only two ways: One, it moves values away from the denormal interval of single-precision values, (-2 ^-126 , 2 ^-126 ). 缩放可以在只有两种方式作用：一是，它移动值远离单精度值的反规范间隔，（-2 ^{^-126，-126} 2）。 (Eg, if you multiply by, say, 2 ¹²³ values that were in [2 ^-249 , 2 ^-126 ) are mapped to [2 ^-126 , 2 ^-3 ), which is outside the denormal interval.) Two, it changes where values lie in each “binade” (interval from one power of two to the next). （例如，如果你乘以，比方说，2点¹²³的值即是在[2 ^{^-249，-126} 2）被映射到[2 ^-126，2 ^-3），其是反规范区间之外。）两个，它改变其中值位于每个“binade”中（从一个2的幂到下一个的间隔）。 Eg, your maximum value is 20, where the relative error may be 1/2 ULP / 20, where the ULP for that binade is 16*2 ^-23 = 2 ^-19 , so the relative error may be 1/2 * 2 ^-19 / 20, about 4.77e-8. 例如，你的最大值是20，其中相对误差可能是1/2 ULP / 20，其中该binade的ULP是16 * 2 ^-23 = 2 ^-19 ，所以相对误差可能是1/2 * 2 ^{-二十零分之十九} ，约4.77e-8。 Suppose you scale by 32/20, so values just under 20 become values just under 32. Then, when you convert to float, the relative error is at most 1/2 * 2 ^-19 / 32 (or just under 32), about 2.98e-8. 假设您按比例缩放32/20，因此20以下的值变为低于32的值。然后，当您转换为浮点数时，相对误差最多为1/2 * 2 ^-19 / 32（或略低于32），约为2.98E-8。 So you may reduce the error slightly. 所以你可以稍微减少错误。

With regard to the former, if your values are nearly normally distributed, very few are in (-2 ^-126 , 2 ^-126 ), simply because that interval is so small. 对于前者，如果你的价值观几乎正态分布，很少在（-2 ^{^-126，-126} 2），仅仅是因为该区间是如此之小。 (A trillion samples of your normal distribution almost certainly have no values in that interval.) You say these are scientific measurements, so perhaps they are produced with some instrument. （正常分布的万亿个样本几乎肯定在该区间内没有值。）你说这些是科学测量，所以也许它们是用一些仪器生成的。 It may be that the machine does not measure or calculate finely enough to return values that range from 2 ^-126 to 20, so it would not surprise me if you have no values in the denormal interval at all. 可能是机器没有足够精确地测量或计算以返回范围从2到¹²⁶到20的值，所以如果你在非正常间隔中没有值，那么我不会感到惊讶。 If you have no values in the single-precision denormal range, then scaling to avoid that range is of no use. 如果在单精度非正规范围内没有值，则缩放以避免该范围是没有用的。

With regard to the latter, we see a small improvement is available at the end of your range. 关于后者，我们看到在您的范围结束时可以获得一些小的改进。 However, elsewhere in your range, some values are also moved to the high end of a binade, but some are moved across a binade boundary to the small end of a new binade, resulting in increased relative error for them. 然而，在你的范围内的其他地方，一些值也被移动到一个binade的高端，但是一些值被移动到一个binade边界到一个新的binade的小端，导致它们的相对误差增加。 It is unlikely there is a significant net improvement. 不太可能出现明显的净改善。

On the other hand, we do not know what is significant for your application. 另一方面，我们不知道对您的应用程序有什么重要意义。 How much error can your application tolerate? 您的应用程序可以容忍多少错误？ Will the change in the ultimate result be unnoticeable if random noise of 1% is added to each number? 如果每个数字增加1％的随机噪音，最终结果的变化是否会变得不明显？ Or will the result be completely unacceptable if a few numbers change by as little as 2 ^-200 ? 或者，如果少数数字变化为2 ^-200 ，结果是否完全不可接受？

What do you know about the machinery producing these numbers? 您对生产这些数字的机器了解多少？ Is it truly producing numbers more precise than single-precision floats? 它是否真正产生比单精度浮子更精确的数字？ Perhaps, although it produces 64-bit floating-point values, the actual values are limited to a population that is representable in 32-bit floating-point. 也许，尽管它产生64位浮点值，但实际值仅限于可在32位浮点中表示的总体。 Have you performed a conversion from double to float and measured the error? 您是否执行了从double到float的转换并测量了错误？

There is still insufficient information to rule out these or other possibilities, but my best guess is that there is little to gain by any transformation. 仍然没有足够的信息来排除这些或其他可能性，但我最好的猜测是任何转变都没有什么好处。 Converting to float will either introduce too much error or it will not, and transforming the numbers first is unlikely to alter that. 转换为float会引入太多错误，或者不会引入错误，并且首先转换数字不太可能改变这种情况。

The exponent for float32 is quite a lot smaller (or bigger in the case of negative exponents), but assuming all you numbers are less than that you only need to worry about the loss of precision. float32的指数相当小（或者在负指数的情况下更大），但假设所有数字都小于那个，你只需要担心精度的损失。 float32 is only good to about 7 or 8 significant decimal digits float32仅适用于大约7或8位有效十进制数字