简体   繁体   中英

Python pandas large floats with to_csv

I am having a recurring problem with saving large numbers in Python to csv. The numbers are millisecond epoch time stamps, which I cannot convert or truncate and have to save in this format. As the columns with the millisecond timestamps also contain some NaN values, pandas casts them automatically to float (see the documentation in the Gotchas under "Support for integer NA".

I cannot seem to avoid this behaviour, so my question is, how can I save these numbers as an integer value when using df.to_csv, ie with no decimal point or trailing zeros? I have columns with numbers of different floating precision in the same dataframe and I do not want to lose the information there. Using the float_format parameter in to_csv seems to apply the same format for ALL float columns in my dataframe.

An example:

>>> df = pd.DataFrame({'a':[1.25, 2.54], 'b':[1424380449437, 1425510731187]})
>>> df['b'].dtype
Out[1]: dtype('int64')
>>> df.loc[2] = np.NaN
>>> df
Out[1]: 
       a             b
0   1.25  1.424380e+12
1   2.54  1.425511e+12
2    NaN           NaN
>>> df['b'].dtype
dtype('float64')
>>> df.to_csv('test.csv')
>>> with open ('test.csv') as f:
...     for line in f:
...         print(line)
,a,b
0,1.25,1.42438044944e+12
1,2.54,1.42551073119e+12
2,,

As you can see, I lost the precision of the last two digits of my epoch time stamp.

While pd.to_csv does not have a parameter to change the format of individual columns, pd.to_string does. It is a little cumbersome and might be a problem for very large DataFrames but you can use it to produce a properly formatted string and then write that string to a file (as suggested in this answer to a similar question). to_string 's formatters parameter takes for example a dictionary of functions to format individual columns. In your case, you could write your own custom formatter for the "b" column, leaving the defaults for the other column(s). This formatter might look somewhat like this:

def printInt(b):
    if pd.isnull(b):
        return "NaN"
    else:
        return "{:d}".format(int(b))

Now you can use this to produce your string:

df.to_string(formatters={"b": printInt}, na_rep="NaN")

which gives:

'      a             b\n0  1.25 1424380449437\n1  2.54 1425510731187\n2   NaN           NaN'

You can see that there is still the problem that this is not comma separated and to_string actually has no parameter to set a custom delimiter, but this can easily be fixed by a regex:

import re
re.sub("[ \t]+(NaN)?", ",",
       df.to_string(formatters={"b": printInt}, na_rep="NaN"))

gives:

',a,b\n0,1.25,1424380449437\n1,2.54,1425510731187\n2,,'

This can now be written into the file:

with open("/tmp/test.csv", "w") as f:
    print(re.sub("[ \t]+(NaN)?", ",",
                 df.to_string(formatters={"b": printInt}, na_rep="NaN")),
          file=f)

which results in what you wanted:

,a,b  
0,1.25,1424380449437  
1,2.54,1425510731187  
2,,  

If you want to keep the NaN 's in the csv-file, you can just change the regex:

with open("/tmp/test.csv", "w") as f:
    print(re.sub("[ \t]+", ",",
                 df.to_string(formatters={"b": printInt}, na_rep="NaN")),
          file=f)

will give:

,a,b
0,1.25,1424380449437
1,2.54,1425510731187
2,NaN,NaN

If your DataFrame contained strings with whitespaces before, a robust solution is not as easy. You could insert another character in front of every value, that indicates the start of the next entry. If you have only single whitespaces in all strings you could use another whitespace for example. This would change the code to this:

import pandas as pd
import numpy as np
import re

df = pd.DataFrame({'a a':[1.25, 2.54], 'b':[1424380449437, 1425510731187]})
df.loc[2] = np.NaN

def printInt(b):
    if pd.isnull(b):
        return " NaN"
    else:
        return " {:d}".format(int(b))

def printFloat(a):
    if pd.isnull(a):
        return " NaN"
    else:
        return " {}".format(a)

with open("/tmp/test.csv", "w") as f:
    print(re.sub("[ \t][ \t]+", ",",
                 df.to_string(formatters={"a": printFloat, "b": printInt},
                              na_rep="NaN", col_space=2)),
          file=f)

which would give:

,a a,b
0,1.25,1424380449437
1,2.54,1425510731187
2,NaN,NaN

Maybe this could work:

pd.set_option('precision',15)
df = pd.DataFrame({'a':[1.25, 2.54], 'b':[1424380449437, 1425510731187]})
fg = df.applymap(lambda x: str(x))
fg.loc[2] = np.NaN
fg.to_csv('test.csv', na_rep='NaN')

Your output should be something like this (I'm on a mac):

在此处输入图片说明

我对大数有同样的问题,这是excel文件的正确方法 df = "\\t" + df

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM