简体   繁体   English

用python格式化日期确实不一致

[英]formatting really inconsistent dates with python

I have some really messed up dates that I'm trying to get into a consistent format %Y-%m-%d if it applies. 我有一些真正弄乱的日期,如果适用的话,我试图将其设为一致格式%Y-%m-%d。 Some of the dates lack the day, some of the dates are in the future or just plain impossible for those I'll just flag as incorrect. 有些日期缺少日期,有些日期是将来的日期,或者对于那些我将其标记为不正确的日期来说根本不可能。 How might I tackle such inconsistencies with python? 我该如何解决python的此类不一致问题?

sample dates:
4-Jul-97
8/31/02
20-May-95
5/12/92
Jun-13
8/4/98
90/1/90
3/10/77
7-Dec
nan
4/3/98
Aug-76
Mar-90
Sep, 2020
Apr-74
10/10/03
Dec-00

you can use the dateutil parser if you want 您可以使用dateutil解析器

from dateutil.parser import parse
bad_dates = [...]
for d in bad_dates:
    try:
        print parse(d)
    except Exception, err:
        print 'couldn\'t parse', d, err

outputs 输出

1997-07-04 00:00:00
2002-08-31 00:00:00
1995-05-20 00:00:00
1992-05-12 00:00:00
2015-06-13 00:00:00
1998-08-04 00:00:00
couldn't parse 90/1/90 day is out of range for month
1977-03-10 00:00:00
2015-12-07 00:00:00
couldn't parse nan unknown string format
1998-04-03 00:00:00
1976-08-30 00:00:00
1990-03-30 00:00:00
2020-09-30 00:00:00
1974-04-30 00:00:00
2003-10-10 00:00:00
couldn't parse Dec-00 day is out of range for month

if you would like to flag any that arent an easy parse you can check to see if they have 3 parts to parse and if they do try and parse it or else flag it like so 如果您想标记任何易于解析的内容,则可以检查它们是否具有3个要解析的部分,以及是否确实尝试对其进行解析或以其他方式标记它

flagged, good = [],[]
splitters = ['-', ',', '/']
for d in bad_dates:
    try:
        a = None
        for s in splitters:
            if len(d.split(s)) == 3:
                a = parse(d)
                good.append(a)
        if not a:
            raise Exception
    except Exception, err:
        flagged.append(d)

Some of the values are ambiguous. 其中一些值是模棱两可的。 You can get different result depending on priorities eg, if you want all dates to be treated consistently; 您可以根据优先级获得不同的结果,例如,如果您希望所有日期都得到一致处理; you could specify a list of formats to try: 您可以指定格式列表进行尝试:

#!/usr/bin/env python
import re
import sys
from datetime import datetime

for line in sys.stdin:
    date_string = " ".join(re.findall(r'\w+', line)) # normalize delimiters
    for date_format in ["%d %b %y", "%m %d %y", "%b %y", "%d %b", "%b %Y"]:
        try:
            print(datetime.strptime(date_string, date_format).date())
            break
        except ValueError:
            pass
    else: # no break
        sys.stderr.write("failed to parse " + line)

Example: 例:

$ python . <input.txt 
1997-07-04
2002-08-31
1995-05-20
1992-05-12
2013-06-01
1998-08-04
failed to parse 90/1/90
1977-03-10
1900-12-07
failed to parse nan
1998-04-03
1976-08-01
1990-03-01
2020-09-01
1974-04-01
2003-10-10
2000-12-01

You could use other criteria eg, you could maximize number of dates that are parsed successfully even if some dates are treated inconsistently instead ( dateutil , pandas solution might give solutions in this category). 您可以使用其他条件,例如,即使某些日期不一致地使用,也可以最大化成功解析的日期数( dateutilpandas解决方案可以提供此类别的解决方案)。

pd.datetools.to_datetime will have a go at guessing for you, it seems to go ok with most of your your dates, although you might want to put in some additional rules? pd.datetools.to_datetime可以为您pd.datetools.to_datetime猜测,尽管您可能想添加一些其他规则,但大多数日期似乎都可以接受?

df['sample'].map(lambda x : pd.datetools.to_datetime(x))
Out[52]: 
0     1997-07-04 00:00:00
1     2002-08-31 00:00:00
2     1995-05-20 00:00:00
3     1992-05-12 00:00:00
4     2015-06-13 00:00:00
5     1998-08-04 00:00:00
6                 90/1/90
7     1977-03-10 00:00:00
8     2015-12-07 00:00:00
9                     NaN
10    1998-04-03 00:00:00
11    1976-08-01 00:00:00
12    1990-03-01 00:00:00
13    2015-09-01 00:00:00
14    1974-04-01 00:00:00
15    2003-10-10 00:00:00
16                 Dec-00
Name: sample, dtype: object

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM