[英]How do I prevent pandas.to_datetime() function from converting 0001-01-01 to 2001-01-01
I have read-only access to a database that I query and read into a Pandas dataframe using pymssql. 我对数据库进行只读访问,我使用pymssql查询并读入Pandas数据帧。 One of the variables contains dates, some of which are stored as midnight on 01 Jan 0001 (ie 0001-01-01 00:00:00.0000000).
其中一个变量包含日期,其中一些日期存储在0001年1月1日午夜(即0001-01-01 00:00:00.0000000)。 I've no idea why those dates should be included – as far as I know, they are not recognised as a valid date by SQL Server and they are probably due to some default data entry.
我不知道为什么要包含这些日期 - 据我所知,它们不被SQL Server认可为有效日期,它们可能是由于某些默认数据输入。 Nevertheless, that's what I have to work with.
然而,这就是我必须要做的事情。 This can be recreated as a dataframe as follows:
这可以重新创建为数据帧,如下所示:
import numpy as np
import pandas as pd
tempDF = pd.DataFrame({ 'id': [0,1,2,3,4],
'date': ['0001-01-01 00:00:00.0000000',
'2015-05-22 00:00:00.0000000',
'0001-01-01 00:00:00.0000000',
'2015-05-06 00:00:00.0000000',
'2015-05-03 00:00:00.0000000']})
The dataframe looks like: 数据框如下所示:
print(tempDF)
date id
0 0001-01-01 00:00:00.0000000 0
1 2015-05-22 00:00:00.0000000 1
2 0001-01-01 00:00:00.0000000 2
3 2015-05-06 00:00:00.0000000 3
4 2015-05-03 00:00:00.0000000 4
... with the following dtypes: ...使用以下dtypes:
print(tempDF.dtypes)
date object
id int64
dtype: object
print(tempDF.dtypes)
However, I routinely convert date fields in the dataframe to datetime format using: 但是,我经常使用以下方法将数据框中的日期字段转换为日期时间格式:
tempDF['date'] = pd.to_datetime(tempDF['date'])
However, by chance, I've noticed that the 0001-01-01 date is converted to 2001-01-01. 但是,我偶然发现0001-01-01的日期转换为2001-01-01。
print(tempDF)
date id
0 2001-01-01 0
1 2015-05-22 1
2 2001-01-01 2
3 2015-05-06 3
4 2015-05-03 4
I realise that the dates in the original database are incorrect because SQL Server doesn't see 0001-01-01 as a valid date. 我意识到原始数据库中的日期不正确,因为SQL Server没有将0001-01-01视为有效日期。 But at least in the 0001-01-01 format, such missing data are easy to identify within my Pandas dataframe.
但至少在0001-01-01格式中,这些丢失的数据很容易在我的Pandas数据帧中识别。 However, when pandas.to_datetime() changes these dates so they lie within a feasible range, it is very easy to miss such outliers.
但是,当pandas.to_datetime()更改这些日期以使它们位于可行范围内时,很容易错过这些异常值。
How can I make sure that pd.to_datetime doesn't interpret the outlier dates incorrectly? 如何确保pd.to_datetime不能错误地解释异常值日期?
If you provide a format
, these dates will not be recognized: 如果您提供
format
,则无法识别这些日期:
In [92]: pd.to_datetime(tempDF['date'], format="%Y-%m-%d %H:%M:%S.%f", errors='coerce')
Out[92]:
0 NaT
1 2015-05-22
2 NaT
3 2015-05-06
4 2015-05-03
Name: date, dtype: datetime64[ns]
By default it will error, but by passing errors='coerce'
, they are converted to NaT values ( coerce=True
for older pandas versions). 默认情况下它会出错,但是通过传递
errors='coerce'
,它们会被转换为NaT值(对于旧的pandas版本, coerce=True
)。
The reason pandas converts these "0001-01-01" dates to "2001-01-01" without providing a format
, is because this is the behaviour of dateutil
: pandas将这些“0001-01-01”日期转换为“2001-01-01”而不提供
format
的原因是因为这是dateutil
的行为:
In [32]: import dateutil
In [33]: dateutil.parser.parse("0001-01-01")
Out[33]: datetime.datetime(2001, 1, 1, 0, 0)
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.