I read in a CSV file containing dates. Some dates may be formatted wrong and I want to find those. With the following approach I would expect the 2nd row to be NaT
. But pandas seems to ignore the specified format no matter if I set infer_datetime_format
or exact
.
import pandas as pd
from io import StringIO
DATA = StringIO("""date
2019 10 07
2018 10
""")
df = pd.read_csv(DATA)
df['date'] = pd.to_datetime(df['date'], format="%Y %m %d", errors='coerce', exact=True)
results in
date
0 2019-10-07
1 2018-10-01
The pandas.to_datetime documentation refers to strftime() and strptime() Behavior but when I test it with plain Python it works:
datetime.datetime.strptime(' 2018 10', '%Y %m %d')
I get the expected value error:
ValueError: time data ' 2018 10' does not match format '%Y %m %d'
What do I miss?
FYI: This question pandas to_datetime not working seems to be related but is different and it seems to be fixed by now. It is working with my pandas version 0.25.2.
This is a known bug, see github for details.
Since we needed a solution I came up with the following workaround. Please note that in my question I used read_csv
to keep the reproducible code snippet small and simple. We actually use read_fwf
and here is some sample data (time.txt):
2019 10 07 + 14:45 15:00 # Foo
2019 10 07 + 18:00 18:30 # Bar
2019 10 09 + 13:00 13:45 # Wrong indentation
I felt stating the row number is also a good idea so I added a little bit more voodoo:
class FileSanitizer(io.TextIOBase):
row = 0
date_range = None
def __init__(self, iterable, date_range):
self.iterable = iterable
self.date_range = date_range
def readline(self):
result = next(self.iterable)
self.row += 1
try:
datetime.datetime.strptime(result[self.date_range[0]:self.date_range[1]], "%Y %m %d")
except ValueError as excep:
raise ValueError(f'row: {self.row} => {str(excep)}') from ValueError
return result
filepath = 'time.txt'
colspecs = [[0, 10], [13, 18], [19, 25], [26, None]]
names = ['date', 'start', 'end', 'description']
with open(filepath, 'r') as file:
df = pd.read_fwf(FileSanitizer(file, colspecs[0]),
colspecs=colspecs,
names=names,
)
The solution is based on this answer How to skip blank lines with read_fwf in pandas? . Please note this will not work with read_csv
.
Now I get the following error as expected:
ValueError: row: 3 => time data ' 2019 10 ' does not match format '%Y %m %d'
If anyone has a more sophisticated answer I'm happy to learn.
There is an issue discussing this same aspect of pd.to_datetime
with regards to exact string matching.
The thing is that if format is specified and exact is set to True
, its a.match
like search, meaning it must match at the beginning (as opposed to anywhere). So even though a given date is missing a day, it is a valid match.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.