简体   繁体   中英

Using Pandas to reformat dates with inconsistent User inputs

I am trying to clean a spreadsheet of user-inputted data that includes a "birth_date" column. The issue I am having is that the date formating ranges widely between users, including inputs without markers between the date, month, and year. I am having a hard time developing a formula that is intelligent enough to interpret such a wide range of inputs. Here is a sample:

1/6/46
7/28/99
11272000
11/28/78

Here is where I started:

df['birth_date']=pd.to_datetime(df.birth_date)

This does not seem to make it past the first example, as it looks for a two-month format. Can anyone help with this?

Your best bet is to check each input and give a consistent output. Assuming Month-Day-Year formats, you can use this function

import pandas as pd
import re

def fix_dates(dates):
    new = []
    for date in dates:
        chunks = re.split(r"[\/\.\-]", date)
        if len(chunks) == 3:
            m, d, y = map(lambda x: x.zfill(2), chunks)
            y = y[2:] if len(y) == 4 else y
            new.append(f"{m}/{d}/{y}")
        else:
            m = date[:2]
            d = date[2:4]
            y = date[4:]
            y = y[2:] if len(y) == 4 else y
            new.append(f"{m}/{d}/{y}")
    return new

inconsistent_dates = '1/6/46 7/28/99 11272000 11/28/78'.split(' ')

pd.to_datetime(pd.Series(fix_dates(inconsistent_dates)))

0   2046-01-06
1   1999-07-28
2   2000-11-27
3   1978-11-28
dtype: datetime64[ns]

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM