简体   繁体   中英

Python/numpy floating-point text precision

Let's say I have some 32-bit and 64-bit floating point values:

>>> import numpy as np
>>> v32 = np.array([5, 0.1, 2.4, 4.555555555555555, 12345678.92345678635], 
                   dtype=np.float32)
>>> v64 = np.array([5, 0.1, 2.4, 4.555555555555555, 12345678.92345678635], 
                   dtype=np.float64)

I want to serialize these values to text without losing precision (or at least really close to not losing precision). I think the canonical way of doing this is with repr :

>>> map(repr, v32)
['5.0', '0.1', '2.4000001', '4.5555553', '12345679.0']
>>> map(repr, v64)
['5.0', '0.10000000000000001', '2.3999999999999999', '4.5555555555555554', 
 '12345678.923456786']

But I want to make the representation as compact as possible to minimize file size, so it would be nice if values like 2.4 got serialized without the extra decimals. Yes, I know that's their actual floating point representation, but %g seems to be able to take care of this:

>>> ('%.7g ' * len(v32)) % tuple(v32)
'5 0.1 2.4 4.555555 1.234568e+07 '
>>> ('%.16g ' * len(v32)) % tuple(v64)
'5 0.1 2.4 4.555555555555555 12345678.92345679 '

My question is: is it safe to use %g in this way? Are .7 and .16 the correct values so that precision won't be lost?

Python 2.7 and later already have a smart repr implementation for floats that prints 0.1 as 0.1 . The brief output is chosen in preference to other candidates such as 0.10000000000000001 because it is the shortest representation of that particular number that roundtrips to the exact same floating-point value when read back into Python. To use this algorithm, convert your 64-bit floats to actual Python floats before handing them off to repr :

>>> map(repr, map(float, v64))
['5.0', '0.1', '2.4', '4.555555555555555', '12345678.923456786']

Surprisingly, the result is natural-looking and numerically correct. More info on the 2.7/3.2 repr can be found in What's New and a fascinating lecture by Mark Dickinson.

Unfortunately, this trick won't work for 32-bit floats, at least not without reimplementing the algorithm used by Python 2.7's repr .

To uniquely determine a single-precision (32-bit) floating point number in IEEE-754 format, it can be necessary to use 9 (significant, ie not starting with 0, unless the value is 0) decimal digits, and 9 digits are always sufficient.

For double-precision (64-bit) floating point numbers, 17 (significant) decimal digits may be necessary and are always sufficient.

I'm not quite sure how the %g format is specified, by the looks of it, it can let the representation begin with a 0 (0.1), so the safe values for the precision would be .9 and .17 .

If you want to minimise the file size, writing the byte representations would produce a much smaller file, so if you can do that, that's the way to go.

The C code that implements the fancy repr in 2.7 is mostly in Python/dtoa.c (with wrappers in Python/pystrtod.c and Objects/floatobject.c). In particular, look at _Py_dg_dtoa. It should be possible to borrow this code and modify it to work with float instead of double. Then you could wrap this up in an extension module, or just build it as an so and ctypes it.

Also, note that the source says the implementation is "Inspired by "How to Print Floating-Point Numbers Accurately" by Guy L. Steele, Jr. and Jon L. White [Proc. ACM SIGPLAN '90, pp. 112-126]." So, you might be able to implement something less flexible and simpler yourself by reading that paper (and whichever of the modfications documented in the dtoa.c comments seem appropriate).

Finally, the code is a minor change to code posted by David Gay at AT&T, and used in a number of other libraries (NSPR, etc.), one of which might have a more accessible version.

But before doing any of that, make sure there really is a performance issue by trying a Python function and measuring whether it's too slow.

And if this really is a performance-critical area, you probably don't want to loop over the list and call repr (or your own fancy C function) in the first place; you probably want a function that converts a numpy array of floats or doubles to a string all at once. (Ideally you'd want to build that into numpy, of course.)

One last thought: you're looking for "at least really close to not losing precision". It's conceivable that just converting to double and using the repr is close enough for your purposes, and it's obviously much easier than anything else, so you should at least test it to rule it out.

Needless to say, you should also test whether %.9g and %.17g are close enough for your purposes, since that's the next easiest thing that could possibly work.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM