简体   繁体   中英

pd.read_csv() incorrectly truncating timestamps formatted as scientific notation in Excel

I have a dataset in csv format that automatically downloads from a webservice. The csv file has the following general format:

csv file in excel
[Timestamp]  [Column B]
1.51258E+12  A
1.51242E+12  B
1.51242E+12  C

When the ['Timestamp'] formatting is changed in excel from 'General' to 'Number', the full number shows as follows:

csv file (formatting changed in excel)
[Timestamp]   [Column B]
1512584017891  A
1512423886571  B
1512423818970  C

I need to automate the processing of the csv file, and so I cannot go into the file in excel every time to switch the format from general to timestamp. What I'm finding is that pd.read_csv() is importing the ['Timestamp'] csv column as scientific notation, leaving a truncated df['Timestamp'] dtype=float64 .

df (in pandas)
[Timestamp]  [Column B]
1.512580e+12  A
1.512420e+12  B
1.512420e+12  C

Notice how now df['Timestamp'] added a 0 prior to the E+12 when importing. I tried to convert df['Timestamp'].astype('int64') , but this showed what I was worried about: pd.read_csv() substituted the hidden digits for zeros.

In[1]: df['Timestamp'].astype('int64').head(3)

Out[1]: 1512580000000
        1512420000000
        1512420000000
        Name: Timestamp, dtype: int64

Is there a way to 1) import the right timestamp, and then 2) convert that timestamp to the following format: 12/14/2017 10:32:12 AM?

You can dictate the data type of each column with pd.read_csv with the optional dtype parameter. This should avoid the loss of data from starting with pandas' default interpretation and then converting after you've already read in the data:

df = pd.read_csv('fname.csv', dtype = {'Timestamp': np.int64})
In the below answer, I have tried using pandas.to_datetime, to convert 
the epoch time into date time.
I'm reading data from csv like below:

import pandas as pd
df = pd.read_csv(path) 
print(df) 

      Timestamp
0  1.512580e+12
1  1.512420e+12
2  1.512420e+12

df.Timestamp = pd.to_datetime(df['Timestamp'], unit='ms')
print(df)

            Timestamp
0 2017-12-06 17:06:40
1 2017-12-04 20:40:00
2 2017-12-04 20:40:00


df.applymap(type)


Timestamp
0   <class 'pandas._libs.tslib.Timestamp'>
1   <class 'pandas._libs.tslib.Timestamp'>
2   <class 'pandas._libs.tslib.Timestamp'>

There might be a way to get pandas to read in your data properly. But I'm not knowledgeable enough with it to know how.

What I do know is that Python gives you the tools to take control of the reading and the critical parts of the data conversion yourself (so that you're not mercy to implicit and possibly lossy conversions performed by pandas).

In the comments, you said the raw, downloaded CSV contains all the timestamp digits when viewed in a text editor. So, let's say that raw data looks like this:

1512584017891,A
1512423886571,B
1512423818970,C

You can read the data with plain Python as follows:

with open('myfile.csv') as f:
    for line in f:
        print(line.strip().split(','))

(If the raw CSV is larger or more complex or has the possibility of "troublesome" characters, such as commas that are part of the data and not merely delimiters, then you'll want to use the csv module instead of simply splitting on all commas.)

The above produces

['1512584017891', 'A']
['1512423886571', 'B']
['1512423818970', 'C']

So you see, you have all the digits. You can losslessly convert those digits to Python integers (which have arbitrary precision) with the built-in int function, or to Python floats (IEEE doubles) with the built-in float function. For example, if we start again from the raw CSV input:

with open('myfile.csv') as f:
    for line in f:
        tokens = line.strip().split(',')
        ms = int(tokens[0])  # my guess is you have milliseconds
        label = tokens[1]
        print([ms, label])

That prints out

[1512584017891, 'A']
[1512423886571, 'B']
[1512423818970, 'C']

Do you see where I'm going with this? Maybe this is an appropriate place to pass the data along to pandas, maybe not. You can take it further with plain Python and hold off on giving control to pandas:

import time

with open('myfile.csv') as f:
    for line in f:
        tokens = line.strip().split(',')
        secs = int(tokens[0]) * 0.001
        label = tokens[1]
        print([time.ctime(secs), label])

The above produces

['Wed Dec  6 13:13:37 2017', 'A']
['Mon Dec  4 16:44:46 2017', 'B']
['Mon Dec  4 16:43:38 2017', 'C']

Note that the output of time.ctime is a formatted string, and that it truncates the fractions of a second. If you want a proper Python "timestamp" (which preserves down to microseconds, if available), better to use datetime :

from datetime import datetime

with open('myfile.csv') as f:
    for line in f:
        tokens = line.strip().split(',')
        secs = int(tokens[0]) * 0.001
        label = tokens[1]
        print([datetime.fromtimestamp(secs), label])

which produces

[datetime.datetime(2017, 12, 6, 13, 13, 37, 891000), 'A']
[datetime.datetime(2017, 12, 4, 16, 44, 46, 571000), 'B']
[datetime.datetime(2017, 12, 4, 16, 43, 38, 970000), 'C']

Once you have a proper datetime object, you can do a number of things with it, including choosing a string formatted according to your own spec, or doing calculations with it. It also might be safe to pass datetime objects to pandas, I don't know.

The point is, wherever it is that pandas is failing you, you have the option of handling it yourself with just Python and its standard library.

Finally, since you said you want to ultimately wind up with another CSV as your output: I think it's worth mentioning that if that CSV is meant to be opened by a human being using Excel (or LibreOffice or whatever), then consider doing them a favor and output directly to a .xlsx file instead. And for that, you can once again use pandas, or "lower-level" packages like XlsxWriter . (This isn't very low-level, but it's lower-level than pandas. In fact, it's used by pandas, but you can use it directly for more control and richer functionality.)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM