Pandas to_datetime 没有错误格式错误

Question

I read in a CSV file containing dates.我读了一个包含日期的 CSV 文件。 Some dates may be formatted wrong and I want to find those.有些日期的格式可能错误，我想找到那些。 With the following approach I would expect the 2nd row to be NaT .使用以下方法，我希望第二行是NaT 。 But pandas seems to ignore the specified format no matter if I set infer_datetime_format or exact .但是 pandas 似乎忽略了指定的格式，无论我设置infer_datetime_format还是exact 。

import pandas as pd
from io import StringIO

DATA = StringIO("""date
2019 10 07
   2018 10
""")
df = pd.read_csv(DATA)

df['date'] = pd.to_datetime(df['date'], format="%Y %m %d", errors='coerce', exact=True)

results in结果是

        date
0 2019-10-07
1 2018-10-01

The pandas.to_datetime documentation refers to strftime() and strptime() Behavior but when I test it with plain Python it works: pandas.to_datetime文档指的是strftime() 和 strptime() 行为，但是当我使用普通的 Python 对其进行测试时，它可以工作：

datetime.datetime.strptime('  2018 10', '%Y %m %d')

I get the expected value error:我得到预期值错误：

ValueError: time data '  2018 10' does not match format '%Y %m %d'

What do I miss?我想念什么？

FYI: This question pandas to_datetime not working seems to be related but is different and it seems to be fixed by now.仅供参考：这个问题pandas to_datetime not working似乎相关但有所不同，现在似乎已修复。 It is working with my pandas version 0.25.2.它适用于我的 pandas 版本 0.25.2。

Answer 1

This is a known bug, see github for details.这是一个已知的错误，详情请参阅github 。

Since we needed a solution I came up with the following workaround.由于我们需要一个解决方案，我想出了以下解决方法。 Please note that in my question I used read_csv to keep the reproducible code snippet small and simple.请注意，在我的问题中，我使用read_csv来保持可重现的代码片段小而简单。 We actually use read_fwf and here is some sample data (time.txt):我们实际上使用read_fwf ，这里是一些示例数据（time.txt）：

2019 10 07 + 14:45 15:00  # Foo
2019 10 07 + 18:00 18:30  # Bar
  2019 10 09 + 13:00 13:45  # Wrong indentation

I felt stating the row number is also a good idea so I added a little bit more voodoo:我觉得说明行号也是一个好主意，所以我添加了更多的伏都教：

class FileSanitizer(io.TextIOBase):
    row = 0
    date_range = None

    def __init__(self, iterable, date_range):
        self.iterable = iterable
        self.date_range = date_range

    def readline(self):
        result = next(self.iterable)
        self.row += 1
        try:
            datetime.datetime.strptime(result[self.date_range[0]:self.date_range[1]], "%Y %m %d")
        except ValueError as excep:
            raise ValueError(f'row: {self.row} => {str(excep)}') from ValueError
        return result


filepath = 'time.txt'
colspecs = [[0, 10], [13, 18], [19, 25], [26, None]]
names = ['date', 'start', 'end', 'description']

with open(filepath, 'r') as file:
    df = pd.read_fwf(FileSanitizer(file, colspecs[0]),
                     colspecs=colspecs,
                     names=names,
                     )

The solution is based on this answer How to skip blank lines with read_fwf in pandas?解决方案基于此答案How to skip blank lines with read_fwf in pandas? . . Please note this will not work with read_csv .请注意，这不适用于read_csv 。

Now I get the following error as expected:现在我按预期收到以下错误：

ValueError: row: 3 => time data '  2019 10 ' does not match format '%Y %m %d'

If anyone has a more sophisticated answer I'm happy to learn.如果有人有更复杂的答案，我很乐意学习。

Answer 2

There is an issue discussing this same aspect of pd.to_datetime with regards to exact string matching.关于精确字符串匹配，讨论pd.to_datetime的同一方面存在问题。

The thing is that if format is specified and exact is set to True , its a.match like search, meaning it must match at the beginning (as opposed to anywhere).问题是，如果指定了格式并将精确设置为True ，则它是一个.match类似的搜索，这意味着它必须在开头匹配（而不是任何地方）。 So even though a given date is missing a day, it is a valid match.因此，即使给定日期缺少一天，它也是有效匹配。

Pandas to_datetime 没有错误格式错误

问题描述

1 个解决方案

解决方案1
1 已采纳 2019-11-06 09:21:43

解决方案2
0 2019-11-04 13:38:58

Pandas to_datetime 没有错误格式错误

问题描述

1 个解决方案

解决方案1 1 已采纳 2019-11-06 09:21:43

解决方案2 0 2019-11-04 13:38:58

解决方案1
1 已采纳 2019-11-06 09:21:43

解决方案2
0 2019-11-04 13:38:58