简体   繁体   English

如何标准化csv文件中的日期? 蟒蛇

[英]How would I normalize dates in a csv file? python

I have a CSV file with a field named start_date that contains data in a variety of formats. 我有一个CSV文件,其名称为start_date的字段包含各种格式的数据。

Some of the formats include eg, June 23, 1912 or 5/11/1930 (month, day, year). 一些格式包括例如June 23, 19125/11/1930 June 23, 1912 (月,日,年)。 But not all values are valid dates. 但并非所有值都是有效日期。

I want to add a start_date_description field adjacent to the start_date column to filter invalid date values into. 我想在start_date列旁边添加一个start_date_description字段,以将无效的日期值过滤到其中。 Lastly, normalize all valid date values in start_date to ISO 8601 (ie, YYYY-MM-DD ). 最后,将start_date中的所有有效日期值标准化为ISO 8601(即YYYY-MM-DD )。

So far I was only able to load the start_date into my file, I am stuck and would appreciate ant help. 到目前为止,我只能够将start_date加载到我的文件中,但我陷入了困境,不胜感激。 Please, any solution especially without using a library would be great! 请,任何解决方案,尤其是不使用库的解决方案都很棒!

import csv

date_column = ("start_date")
f = open("test.csv","r")
csv_reader = csv.reader(f)

headers = None
results = []
for row in csv_reader:
    if not headers:
        headers = []
        for i, col in enumerate(row):
           if col in date_column:
            headers.append(i)
    else:
        results.append(([row[i] for i in headers]))

print results

在此处输入图片说明

One way is to use dateutil module, you can parse data as follows: 一种方法是使用dateutil模块,可以按如下方式解析数据:

from dateutil import parser
parser.parse('3/16/78')
parser.parse('4-Apr') # this will give current year i.e. 2017

Then parsing to your format can be done by 然后可以通过以下方式解析为您的格式

dt = parser.parse('3/16/78')
dt.strftime('%Y-%m-%d')

Suppose you have table in dataframe format, you can now define parsing function and apply to column as follows: 假设您具有数据帧格式的表,现在可以定义解析函数并将其应用于列,如下所示:

def parse_date(start_time):
    try:
        return parser.parse(x).strftime('%Y-%m-%d')
    except:
        return ''
df['parse_date'] = df.start_date.map(lambda x: parse_date(x))

Question : ... add a start_date_description ... normalize ... to ISO 8601 问题 :...添加起始日期说明...标准化...到ISO 8601

This reads the File test.csv and validates the Date String in Column start_date with Date Directive Patterns and returns a dict{description, ISO} . 这将读取文件test.csv并使用日期指令模式验证start_date列中的日期字符串,并返回dict{description, ISO} The returned dict is used to update the current Row dict and the updated Row dict is writen to the File test_update.csv . 返回的dict用于更新当前Row dict ,并将更新的Row dict写入文件test_update.csv

Put this in a NEW Python File and run it! 将其放在一个新的Python文件中并运行它!

A missing valid Date Directive Pattern could be simple added to the Array. 缺少的有效日期指令模式可以简单地添加到数组中。

Python » 3.6 Documentation: 8.1.8. Python»3.6文档: 8.1.8。 strftime() and strptime() Behavior strftime()和strptime()行为

from datetime import datetime as dt
import re

def validate(date):
    def _dict(desc, date):
        return {'start_date_description':desc, 'ISO':date}

    for format in [('%m/%d/%y','Valid'), ('%b-%y','Short, missing Day'), ('%d-%b-%y','Valid'),
                   ('%d-%b','Short, missing Year')]: #, ('%B %d. %Y','Valid')]:
        try:
            _dt = dt.strptime(date, format[0])
            return _dict(format[1], _dt.strftime('%Y-%m-%d'))
        except:
            continue

    if not re.search(r'\d+', date):
        return _dict('No Digit', None)

    return _dict('Unknown Pattern', None)

with open('test.csv') as fh_in, open('test_update.csv', 'w') as fh_out:
    csv_reader = csv.DictReader(fh_in)
    csv_writer = csv.DictWriter(fh_out,
                                fieldnames=csv_reader.fieldnames +
                                           ['start_date_description', 'ISO'] )
    csv_writer.writeheader()

    for row, values in enumerate(csv_reader,2):
        values.update(validate(values['start_date']))

        # Show only Invalid Dates
        if any(w in values['start_date_description'] 
               for w in ['Unknown', 'No Digit', 'missing']):

            print('{:>3}: {v[start_date]:13.13} {v[start_date_description]:<22} {v[ISO]}'.
                  format(row, v=values))

        csv_writer.writerow(values)

Output : 输出

 start_date start_date_description ISO June 23. 1912 Valid 1912-06-23 12/31/91 Valid 1991-12-31 Oct-84 Short, missing Day 1984-10-01 Feb-09 Short, missing Day 2009-02-01 10-Dec-80 Valid 1980-12-10 10/7/81 Valid 1981-10-07 Facere volupt No Digit None ... (omitted for brevity) 

Tested with Python: 3.4.2 使用Python测试:3.4.2

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM