简体   繁体   English

在python 3中将出生日期pandas数据框分成3个不同的列

[英]split a date of birth pandas dataframe into 3 different columns in python 3

df_dob=pd.DataFrame(
[
{'date':'DOB 19 Jun 1951'},
{'date':'DOB Jun 1951'},
{'date':'DOB 1951'}
]
)

there is a dataframe with 3 different types of date of birth:有一个包含 3 种不同类型出生日期的数据框:

df_dob['date'].apply(transform_date) 

I am trying to write a function like above so that the dataframe above will be turned into我正在尝试编写一个类似上面的函数,以便将上面的数据框变成

3 columns: 3 列:

The first column can house 1951-06-19 00:00:00第一栏可容纳 1951-06-19 00:00:00

The 2nd column can house 1951-06第二列可容纳1951-06

The 3rd column can house 1951第三列可容纳1951

Desired output:期望的输出:

1951-06-19 00:00:00, NaN, NaN
NaN,1951-06,NaN
NaN,NaN,1951

The following is my code and there are 2 problems:以下是我的代码,有两个问题:

(1) the regex cannot handle "DOB Jun 1951" and therefore return "TypeError: object of type 'NoneType' has no len()" (1) 正则表达式无法处理“DOB Jun 1951”,因此返回“TypeError: object of type 'NoneType' has no len()”

as mentioned here: Python: TypeError: object of type 'NoneType' has no len()如此处所述: Python: TypeError: object of type 'NoneType' has no len()

(2) if we remove "DOB Jun 1951" from the input, we will get the following error (2) 如果我们从输入中删除“DOB Jun 1951”,我们会得到以下错误

57 df_dob['date'].apply(transform_date) 57 df_dob['date'].apply(transform_date)

"TypeError: invalid type promotion" “类型错误:无效类型促销”

Wonder if there might be any better solution?想知道是否有更好的解决方案? Thanks!谢谢!

import re
from datetime import datetime, timedelta

def transform_date(x):

    if len(x.split(';')) > 0:

        regex = r"\bDOB ((?:(?:3[01]|[12][0-9]|0?[1-9]) [A-Za-z]+ )?\d{4})\b"
        #'DOB (.*)'

        l = len(re.findall(regex, x.split(';')[0]))

        if l > 0:

            # new = re.findall('DOB (.*)', x.split(';')[0])[0]



            # while l <= len(re.findall('DOB (.*)', x.split(';')[0])):

            new = re.findall(regex, x.split(';')[0])[l - 1]

            print(new)

                # print(type(new))

                # l = l+1

            if len(new) == 11:

                print(datetime.strptime(new, '%d %b %Y'))
                return pd.Series([datetime.strptime(new, '%d %b %Y'), np.nan, np.nan])

            elif len(new) == 4:

                print(datetime.strptime(new, '%Y').year)

                return pd.Series([np.nan, np.nan, datetime.strptime(new, '%Y').year])

            else:

                print(str(datetime.strptime(new, '%b %Y').year)) + '-' + str(datetime.strptime(new, '%b %Y').month)

                mmyyyy=str(datetime.strptime(new, '%b %Y').year) + '-' + str(datetime.strptime(new, '%b %Y').month)

                return pd.Series([np.nan, mmyyyy, np.nan])

I think you can extract the dates and skip the DOB :我认为您可以extract日期并跳过DOB

pattern = r"(?P<date1>\d{2}\s[A-Za-z]{3}\s\d{4})|(?P<date2>[A-Za-z]{3}\s\d{4})|(?P<date3>\d{4})"

dates = df["date"].str[3:].str.extract(pattern)
dates[["date1","date2"]] = dates[["date1","date2"]].apply(pd.to_datetime)
print (dates)

       date1      date2 date3
0 1951-06-19        NaT   NaN
1        NaT 1951-06-01   NaN
2        NaT        NaT  1951

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM