使用pd.read_csv时跳过包含错误日期的行

Question

I'm reading in csv files from an external data source using pd.read_csv , as in the code below: 我正在使用pd.read_csv从外部数据源读取csv文件，如下面的代码所示：

pd.read_csv(
    BytesIO(raw_data),
    parse_dates=['dates'],
    date_parser=np.datetime64,
)

However, somewhere in the csv that's being sent, there is a misformatted date, resulting in the following error: 但是，在正在发送的csv中的某个位置，存在格式错误的日期，从而导致以下错误：

ValueError: Error parsing datetime string "2015-08-2" at position 8

This causes the entire application to crash. 这会导致整个应用程序崩溃。 Of course, I can handle this case with a try/except, but then I will lose all the other data in that particular csv. 当然，我可以通过try / except来处理这种情况，但之后我将丢失该特定csv中的所有其他数据。 I need pandas to keep and parse that other data. 我需要pandas来保存和解析其他数据。

I have no way of predicting when/where this data (which changes daily) will have badly formatted dates. 我无法预测此数据（每日更改）何时/何处将具有格式错误的日期。 Is there some way to get pd.read_csv to skip only the rows with bad dates but to still parse all the other rows in the csv? 是否有一些方法可以让pd.read_csv只跳过日期不好的行但是仍然解析csv中的所有其他行？

Answer 1

somewhere in the csv that's being sent, there is a misformatted date 在发送的csv的某个地方，有一个格式错误的日期

np.datetime64 needs ISO8601 formatted strings to work properly. np.datetime64需要ISO8601格式的字符串才能正常工作。 The good news is that you can wrap np.datetime64 in your own function and use this as the date_parser : 好消息是你可以在自己的函数中包装np.datetime64并将其用作date_parser ：

def parse_date(v):
   try:
      return np.datetime64(v)
   except:
      # apply whatever remedies you deem appropriate
      pass
   return v

   pd.read_csv(
     ...
     date_parser=parse_date
   )

I need pandas to keep and parse that other data. 我需要pandas来保存和解析其他数据。

I often find that a more flexible date parser like dateutil works better than np.datetime64 and may even work without the extra function: 我经常发现像dateutil这样更灵活的日期解析器比np.datetime64工作得更好，甚至可以在没有额外功能的情况下工作：

import dateutil
pd.read_csv(
    BytesIO(raw_data),
    parse_dates=['dates'],
    date_parser=dateutil.parser.parse,
)

Answer 2

Here's another way to do this using pd.convert_objects() method: 这是使用pd.convert_objects（）方法执行此操作的另一种方法：

# make good and bad date csv files
# read in good dates file using parse_dates - no problem
df = pd.read_csv('dategood.csv', parse_dates=['dates'], date_parser=np.datetime64)

df.dtypes

dates    datetime64[ns]
data            float64
dtype: object

# try same code on bad dates file - throws exceptions
df = pd.read_csv('datebad.csv', parse_dates=['dates'], date_parser=np.datetime64)

ValueError: Error parsing datetime string "Q%Bte0tvk5" at position 0

# read the file first without converting dates
# then use convert objects to force conversion
df = pd.read_csv('datebad.csv')
df['cdate'] = df.dates.convert_objects(convert_dates='coerce')

# resulting new date column is a datetime64 same as good data file
df.dtype

dates            object
data            float64
cdate    datetime64[ns]
dtype: object

# the bad date has NaT in the cdate column - can clean it later
df.head()

        dates      data      cdate
0  2015-12-01  0.914836 2015-12-01
1  2015-12-02  0.866848 2015-12-02
2  2015-12-03  0.103718 2015-12-03
3  2015-12-04  0.514086 2015-12-04
4  Q%Bte0tvk5  0.583617        NaT

Answer 3

use inbuilt pd.to_datetime , which converts the non date type data to NaT 使用内置的pd.to_datetime ，它将非日期类型数据NaT为NaT

pd.read_csv(
    BytesIO(raw_data),
    parse_dates=['dates'],
    date_parser=pd.to_datetime,
)

Now you can filter out the invalid rows with standard nan/ null check 现在，您可以使用标准nan / null检查过滤掉无效行

df = df[~df["dates"].isnull()]

使用pd.read_csv时跳过包含错误日期的行

问题描述

3 个解决方案

解决方案1
4 已采纳 2015-12-24 22:50:45

解决方案2
1 2015-12-25 01:45:36

解决方案3
0 2018-08-29 19:46:26

使用pd.read_csv时跳过包含错误日期的行

问题描述

3 个解决方案

解决方案1 4 已采纳 2015-12-24 22:50:45

解决方案2 1 2015-12-25 01:45:36

解决方案3 0 2018-08-29 19:46:26

解决方案1
4 已采纳 2015-12-24 22:50:45

解决方案2
1 2015-12-25 01:45:36

解决方案3
0 2018-08-29 19:46:26