简体   繁体   中英

What should I worry about if I compress float64 array to float32 in numpy?

This is a particular kind of lossy compression that's quite easy to implement in numpy.

I could in principle directly compare original (float64) to reconstructed (float64(float32(original)) and know things like the maximum error.

Other than looking at the maximum error for my actual data, does anybody have a good idea what type of distortions this creates, eg as a function of the magnitude of the original value?

Would I be better off mapping all values (in 64-bits) onto say [-1,1] first (as a fraction of extreme values, which could be preserved in 64-bits) to take advantage of greater density of floats near zero?

I'm adding a specific case I have in mind. Let's say I have 500k to 1e6 values ranging from -20 to 20, that are approximately IID ~ Normal(mu=0,sigma=4) so they're already pretty concentrated near zero and the "20" is ~5-sigma rare. Let's say they are scientific measurements where the true precision is a whole lot less than the 64-bit floats, but hard to really know exactly. I have tons of separate instances (potentially TB's worth) so compressing has a lot of practical value, and float32 is a quick way to get 50% (and if anything, works better with an additional round of lossless compression like gzip). So the "-20 to 20" eliminates a lot of concerns about really large values.

The following assumes you are using standard IEEE-754 floating-point operations, which are common (with some exceptions), in the usual round-to-nearest mode.

If a double value is within the normal range of float values, then the only change that occurs when the double is rounded to a float is that the significand (fraction portion of the value) is rounded from 53 bits to 24 bits. This will cause an error of at most 1/2 ULP (unit of least precision). The ULP of a float is 2 -23 times the greatest power of two not greater than the float. Eg, if a float is 7.25, the greatest power of two not greater than it is 4, so its ULP is 4*2 -23 = 2 -21 , about 4.77e-7. So the error when double in the interval [4, 8) is converted to float is at most 2 -22 , about 2.38e-7. For another example, if a float is about .03, the greatest power of two not greater than it is 2 -6 , so the ULP is 2 -29 , and the maximum error when converting to double is 2 -30 .

Those are absolute errors. The relative error is less than 2 -24 , which is 1/2 ULP divided by the smallest the value could be (the smallest value in the interval for a particular ULP, so the power of two that bounds it). Eg, for each number x in [4, 8), we know the number is at least 4 and error is at most 2 -22 , so the relative error is at most 2 -22 /4 = 2 -24 . (The error cannot be exactly 2 -24 because there is no error when converting an exact power of two from float to double, so there is an error only if x is greater than four, so the relative error is less than, not equal to, 2 -24 .) When you know more about the value being converted, eg, it is nearer 8 than 4, you can bound the error more tightly.

If the number is outside the normal range of a float, errors can be larger. The maximum finite floating-point value is 2 128 -2 104 , about 3.40e38. When you convert a double that is 1/2 ULP (of a float; doubles have finer ULP) more than that or greater to float, infinity is returned, which is, of course, an infinite absolute error and an infinite relative error. (A double that is greater than the maximum finite float but is greater by less than 1/2 ULP is converted to the maximum finite float and has the same errors discussed in the previous paragraph.)

The minimum positive normal float is 2 -126 , about 1.18e-38. Numbers within 1/2 ULP of this (inclusive) are converted to it, but numbers less than that are converted to a special denormalized format, where the ULP is fixed at 2 -149 . The absolute error will be at most 1/2 ULP, 2 -150 . The relative error will depend significantly on the value being converted.

The above discusses positive numbers. The errors for negative numbers are symmetric.

If the value of a double can be represented exactly as a float, there is no error in conversion.

Mapping the input numbers to a new interval can reduce errors in specific situations. As a contrived example, suppose all your numbers are integers in the interval [2 48 , 2 48 +2 24 ). Then converting them to float would lose all information that distinguishes the values; they would all be converted to 2 48 . But mapping them to [0, 2 24 ) would preserve all information; each different input would be converted to a different result.

Which map would best suit your purposes depends on your specific situation.

It is unlikely that a simple transformation will reduce error significantly, since your distribution is centered around zero.

Scaling can have effect in only two ways: One, it moves values away from the denormal interval of single-precision values, (-2 -126 , 2 -126 ). (Eg, if you multiply by, say, 2 123 values that were in [2 -249 , 2 -126 ) are mapped to [2 -126 , 2 -3 ), which is outside the denormal interval.) Two, it changes where values lie in each “binade” (interval from one power of two to the next). Eg, your maximum value is 20, where the relative error may be 1/2 ULP / 20, where the ULP for that binade is 16*2 -23 = 2 -19 , so the relative error may be 1/2 * 2 -19 / 20, about 4.77e-8. Suppose you scale by 32/20, so values just under 20 become values just under 32. Then, when you convert to float, the relative error is at most 1/2 * 2 -19 / 32 (or just under 32), about 2.98e-8. So you may reduce the error slightly.

With regard to the former, if your values are nearly normally distributed, very few are in (-2 -126 , 2 -126 ), simply because that interval is so small. (A trillion samples of your normal distribution almost certainly have no values in that interval.) You say these are scientific measurements, so perhaps they are produced with some instrument. It may be that the machine does not measure or calculate finely enough to return values that range from 2 -126 to 20, so it would not surprise me if you have no values in the denormal interval at all. If you have no values in the single-precision denormal range, then scaling to avoid that range is of no use.

With regard to the latter, we see a small improvement is available at the end of your range. However, elsewhere in your range, some values are also moved to the high end of a binade, but some are moved across a binade boundary to the small end of a new binade, resulting in increased relative error for them. It is unlikely there is a significant net improvement.

On the other hand, we do not know what is significant for your application. How much error can your application tolerate? Will the change in the ultimate result be unnoticeable if random noise of 1% is added to each number? Or will the result be completely unacceptable if a few numbers change by as little as 2 -200 ?

What do you know about the machinery producing these numbers? Is it truly producing numbers more precise than single-precision floats? Perhaps, although it produces 64-bit floating-point values, the actual values are limited to a population that is representable in 32-bit floating-point. Have you performed a conversion from double to float and measured the error?

There is still insufficient information to rule out these or other possibilities, but my best guess is that there is little to gain by any transformation. Converting to float will either introduce too much error or it will not, and transforming the numbers first is unlikely to alter that.

The exponent for float32 is quite a lot smaller (or bigger in the case of negative exponents), but assuming all you numbers are less than that you only need to worry about the loss of precision. float32 is only good to about 7 or 8 significant decimal digits

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM