简体   繁体   中英

NumPy: Is it safe to store an int64 value in an np.array with dtype float64 and later convert it back to integer?

I am wondering if I am causing problems because I am assigning and converting data types incorrectly to and from numpy-arrays in Python2.7.

What I am doing is reading a hdf5 64-bit integer value to an numpy.zeros() array from type numpy.float64! Then writing these values to another hdf5 assigning 64-bit unsigned integer!

two example of some original values which are actually ID numbers (so it is crucial that they do not change due to data type conversion):

12028545243
12004994169

Question 1: Will that unsigned integer in the second hdf5-file be the same as in the original?

I checked this with a small subsample but I cannot control if that is true for all of them (there are millions)!

Question 2: If I am reading the 64-bit value from the original file to the numpy-array with data type=float64 and then doing something like:

value=int(value)
value.astype(int64)

will that be exactly the original value or does it change due to the transformation?

Question 3: Will Python interpret the values as I assumed as (a), (b), (c), and (d)? Will there be an issue with formatting the values too, like using scientific notations 'e+10'? Or does Python recognise them as the same value (since it is only a different way to display them ...)?

 1.20285452e+10 == 12028545243.0 == 12028545243 == 12028545243
 1.20049942e+10 == 12004994169.0 == 12004994169 == 12004994169
 (a)             (b)              (c)            (d)   

(a) listed value printing one column of array named data:

print data[:,0] <type 'numpy.ndarray'>

(b) printing a single element in data

print data[0,0] <type 'numpy.float64'>

(c) after doing the conversion

print int(data[0,0]) <type int>

(d) same as (a) but using astype() to convert!

print data[:,0].astype(numpy.int64) <type 'numpy.ndarray'>

You may ask why I am not assigning a int64 type to the numpy-array to be safe? Yes I will do that, but there is data which is already stored wrongly and I need to know if I can still trust this data ...

I am using: Python2.7, Pythonbrew, Ubuntu 14.04 LTS 64-bit on Lenovo T410

Generally, it is NOT save to store a 64 bit integer in a 64 bit float. You can easily see that for example by looking at:

import numpy as np
print(np.int64(2**63-1))
print(np.int64(np.float64(2**63-1))

While the first will give you the correct result (9223372036854775807) the second has a round-off error which results in an integer overflow (-9223372036854775808).

To understand this you have to look at how these numbers are stored. While an integer is basically only storing its absolute value in binary (plus one bit used for the sign of the number) this does not hold for a floating point number.

A floating point stores a number in three parts. One being the sign bit, the next being the significant/mantissa and the last being the exponent. The number is then given as sign times mantissa times 2^exponent. These three have to share the bits available (in your case 64). As specified in numpy's documentation for a np.float64 52 bits are used for the significant and 11 bits are used for the exponent. Therefore, only for integers up to 52 bits you will definitively get the right result if you convert them to a np.float64 and back.

So to answer your first and second question : No you cannot be sure that the numbers are the same if there are any numbers bigger than 2**52-1 in your data set.

Concerning your third question : The formatting is done only when printing the values. When comparing numbers internally the numbers do not have any formatting such that all those values will be considered equal as long as they have exactly the same value.

Btw, if you want to learn more about floating point arithmetic, a very good read is the paper "What every computer scientist should know about floating-point arithmetic" by David Goldberg.

It depends on whether Numpy converts your int64 values into float64 and then back into ints or just store the int-data in the memory reserved for float64. I assume the first option is true. Even without inspecting float64 interna (witch is something one should do anyhow). It's clear that float64 can't have a unique representation for all 2**64 different integers, if it has itself only 2**64 different codes and needing some for 0.1 and so on as well. Float64 uses 52 bit to store a 53 bit long normalized mantissa (the most significant bit is a implicit 1) so if your int has non zero bits more them 52 bits after the the first one like with:

     5764607523034234887
   = 0x5000000000000007
   = 0b0101000000000000000000000000000000000000000000000000000000000111

(witch is a perfectly fine 64 bit integer)

the 0b111 part in the end will just get rounded away and lost after converting it to double in order to fit the number into the mantissa. This information will then be lost for ever. This will likely happen with some of your IDs since they are usually rather big numbers. So try adjusting your array to int64 instead.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM