[英]formatting really inconsistent dates with python
I have some really messed up dates that I'm trying to get into a consistent format %Y-%m-%d if it applies. 我有一些真正弄乱的日期,如果适用的话,我试图将其设为一致格式%Y-%m-%d。 Some of the dates lack the day, some of the dates are in the future or just plain impossible for those I'll just flag as incorrect.
有些日期缺少日期,有些日期是将来的日期,或者对于那些我将其标记为不正确的日期来说根本不可能。 How might I tackle such inconsistencies with python?
我该如何解决python的此类不一致问题?
sample dates:
4-Jul-97
8/31/02
20-May-95
5/12/92
Jun-13
8/4/98
90/1/90
3/10/77
7-Dec
nan
4/3/98
Aug-76
Mar-90
Sep, 2020
Apr-74
10/10/03
Dec-00
you can use the dateutil parser if you want 您可以使用dateutil解析器
from dateutil.parser import parse
bad_dates = [...]
for d in bad_dates:
try:
print parse(d)
except Exception, err:
print 'couldn\'t parse', d, err
outputs 输出
1997-07-04 00:00:00
2002-08-31 00:00:00
1995-05-20 00:00:00
1992-05-12 00:00:00
2015-06-13 00:00:00
1998-08-04 00:00:00
couldn't parse 90/1/90 day is out of range for month
1977-03-10 00:00:00
2015-12-07 00:00:00
couldn't parse nan unknown string format
1998-04-03 00:00:00
1976-08-30 00:00:00
1990-03-30 00:00:00
2020-09-30 00:00:00
1974-04-30 00:00:00
2003-10-10 00:00:00
couldn't parse Dec-00 day is out of range for month
if you would like to flag any that arent an easy parse you can check to see if they have 3 parts to parse and if they do try and parse it or else flag it like so 如果您想标记任何易于解析的内容,则可以检查它们是否具有3个要解析的部分,以及是否确实尝试对其进行解析或以其他方式标记它
flagged, good = [],[]
splitters = ['-', ',', '/']
for d in bad_dates:
try:
a = None
for s in splitters:
if len(d.split(s)) == 3:
a = parse(d)
good.append(a)
if not a:
raise Exception
except Exception, err:
flagged.append(d)
Some of the values are ambiguous. 其中一些值是模棱两可的。 You can get different result depending on priorities eg, if you want all dates to be treated consistently;
您可以根据优先级获得不同的结果,例如,如果您希望所有日期都得到一致处理; you could specify a list of formats to try:
您可以指定格式列表进行尝试:
#!/usr/bin/env python
import re
import sys
from datetime import datetime
for line in sys.stdin:
date_string = " ".join(re.findall(r'\w+', line)) # normalize delimiters
for date_format in ["%d %b %y", "%m %d %y", "%b %y", "%d %b", "%b %Y"]:
try:
print(datetime.strptime(date_string, date_format).date())
break
except ValueError:
pass
else: # no break
sys.stderr.write("failed to parse " + line)
Example: 例:
$ python . <input.txt
1997-07-04
2002-08-31
1995-05-20
1992-05-12
2013-06-01
1998-08-04
failed to parse 90/1/90
1977-03-10
1900-12-07
failed to parse nan
1998-04-03
1976-08-01
1990-03-01
2020-09-01
1974-04-01
2003-10-10
2000-12-01
You could use other criteria eg, you could maximize number of dates that are parsed successfully even if some dates are treated inconsistently instead ( dateutil
, pandas
solution might give solutions in this category). 您可以使用其他条件,例如,即使某些日期不一致地使用,也可以最大化成功解析的日期数(
dateutil
, pandas
解决方案可以提供此类别的解决方案)。
pd.datetools.to_datetime
will have a go at guessing for you, it seems to go ok with most of your your dates, although you might want to put in some additional rules? pd.datetools.to_datetime
可以为您pd.datetools.to_datetime
猜测,尽管您可能想添加一些其他规则,但大多数日期似乎都可以接受?
df['sample'].map(lambda x : pd.datetools.to_datetime(x))
Out[52]:
0 1997-07-04 00:00:00
1 2002-08-31 00:00:00
2 1995-05-20 00:00:00
3 1992-05-12 00:00:00
4 2015-06-13 00:00:00
5 1998-08-04 00:00:00
6 90/1/90
7 1977-03-10 00:00:00
8 2015-12-07 00:00:00
9 NaN
10 1998-04-03 00:00:00
11 1976-08-01 00:00:00
12 1990-03-01 00:00:00
13 2015-09-01 00:00:00
14 1974-04-01 00:00:00
15 2003-10-10 00:00:00
16 Dec-00
Name: sample, dtype: object
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.