[英]skip rows with bad dates while using pd.read_csv
I'm reading in csv files from an external data source using pd.read_csv
, as in the code below: 我正在使用
pd.read_csv
从外部数据源读取csv文件,如下面的代码所示:
pd.read_csv(
BytesIO(raw_data),
parse_dates=['dates'],
date_parser=np.datetime64,
)
However, somewhere in the csv that's being sent, there is a misformatted date, resulting in the following error: 但是,在正在发送的csv中的某个位置,存在格式错误的日期,从而导致以下错误:
ValueError: Error parsing datetime string "2015-08-2" at position 8
This causes the entire application to crash. 这会导致整个应用程序崩溃。 Of course, I can handle this case with a try/except, but then I will lose all the other data in that particular csv.
当然,我可以通过try / except来处理这种情况,但之后我将丢失该特定csv中的所有其他数据。 I need pandas to keep and parse that other data.
我需要pandas来保存和解析其他数据。
I have no way of predicting when/where this data (which changes daily) will have badly formatted dates. 我无法预测此数据(每日更改)何时/何处将具有格式错误的日期。 Is there some way to get
pd.read_csv
to skip only the rows with bad dates but to still parse all the other rows in the csv? 是否有一些方法可以让
pd.read_csv
只跳过日期不好的行但是仍然解析csv中的所有其他行?
somewhere in the csv that's being sent, there is a misformatted date
在发送的csv的某个地方,有一个格式错误的日期
np.datetime64
needs ISO8601 formatted strings to work properly. np.datetime64
需要ISO8601格式的字符串才能正常工作。 The good news is that you can wrap np.datetime64
in your own function and use this as the date_parser
: 好消息是你可以在自己的函数中包装
np.datetime64
并将其用作date_parser
:
def parse_date(v):
try:
return np.datetime64(v)
except:
# apply whatever remedies you deem appropriate
pass
return v
pd.read_csv(
...
date_parser=parse_date
)
I need pandas to keep and parse that other data.
我需要pandas来保存和解析其他数据。
I often find that a more flexible date parser like dateutil
works better than np.datetime64
and may even work without the extra function: 我经常发现像
dateutil
这样更灵活的日期解析器比np.datetime64
工作得更好,甚至可以在没有额外功能的情况下工作:
import dateutil
pd.read_csv(
BytesIO(raw_data),
parse_dates=['dates'],
date_parser=dateutil.parser.parse,
)
Here's another way to do this using pd.convert_objects() method: 这是使用pd.convert_objects()方法执行此操作的另一种方法:
# make good and bad date csv files
# read in good dates file using parse_dates - no problem
df = pd.read_csv('dategood.csv', parse_dates=['dates'], date_parser=np.datetime64)
df.dtypes
dates datetime64[ns]
data float64
dtype: object
# try same code on bad dates file - throws exceptions
df = pd.read_csv('datebad.csv', parse_dates=['dates'], date_parser=np.datetime64)
ValueError: Error parsing datetime string "Q%Bte0tvk5" at position 0
# read the file first without converting dates
# then use convert objects to force conversion
df = pd.read_csv('datebad.csv')
df['cdate'] = df.dates.convert_objects(convert_dates='coerce')
# resulting new date column is a datetime64 same as good data file
df.dtype
dates object
data float64
cdate datetime64[ns]
dtype: object
# the bad date has NaT in the cdate column - can clean it later
df.head()
dates data cdate
0 2015-12-01 0.914836 2015-12-01
1 2015-12-02 0.866848 2015-12-02
2 2015-12-03 0.103718 2015-12-03
3 2015-12-04 0.514086 2015-12-04
4 Q%Bte0tvk5 0.583617 NaT
use inbuilt pd.to_datetime
, which converts the non date type data to NaT
使用内置的
pd.to_datetime
,它将非日期类型数据NaT
为NaT
pd.read_csv(
BytesIO(raw_data),
parse_dates=['dates'],
date_parser=pd.to_datetime,
)
Now you can filter out the invalid rows with standard nan/ null check 现在,您可以使用标准nan / null检查过滤掉无效行
df = df[~df["dates"].isnull()]
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.