[英]split a date of birth pandas dataframe into 3 different columns in python 3
df_dob=pd.DataFrame(
[
{'date':'DOB 19 Jun 1951'},
{'date':'DOB Jun 1951'},
{'date':'DOB 1951'}
]
)
there is a dataframe with 3 different types of date of birth:有一个包含 3 种不同类型出生日期的数据框:
df_dob['date'].apply(transform_date)
I am trying to write a function like above so that the dataframe above will be turned into我正在尝试编写一个类似上面的函数,以便将上面的数据框变成
3 columns: 3 列:
The first column can house 1951-06-19 00:00:00第一栏可容纳 1951-06-19 00:00:00
The 2nd column can house 1951-06第二列可容纳1951-06
The 3rd column can house 1951第三列可容纳1951
Desired output:期望的输出:
1951-06-19 00:00:00, NaN, NaN
NaN,1951-06,NaN
NaN,NaN,1951
The following is my code and there are 2 problems:以下是我的代码,有两个问题:
(1) the regex cannot handle "DOB Jun 1951" and therefore return "TypeError: object of type 'NoneType' has no len()" (1) 正则表达式无法处理“DOB Jun 1951”,因此返回“TypeError: object of type 'NoneType' has no len()”
as mentioned here: Python: TypeError: object of type 'NoneType' has no len()如此处所述: Python: TypeError: object of type 'NoneType' has no len()
(2) if we remove "DOB Jun 1951" from the input, we will get the following error (2) 如果我们从输入中删除“DOB Jun 1951”,我们会得到以下错误
57 df_dob['date'].apply(transform_date) 57 df_dob['date'].apply(transform_date)
"TypeError: invalid type promotion" “类型错误:无效类型促销”
Wonder if there might be any better solution?想知道是否有更好的解决方案? Thanks!谢谢!
import re
from datetime import datetime, timedelta
def transform_date(x):
if len(x.split(';')) > 0:
regex = r"\bDOB ((?:(?:3[01]|[12][0-9]|0?[1-9]) [A-Za-z]+ )?\d{4})\b"
#'DOB (.*)'
l = len(re.findall(regex, x.split(';')[0]))
if l > 0:
# new = re.findall('DOB (.*)', x.split(';')[0])[0]
# while l <= len(re.findall('DOB (.*)', x.split(';')[0])):
new = re.findall(regex, x.split(';')[0])[l - 1]
print(new)
# print(type(new))
# l = l+1
if len(new) == 11:
print(datetime.strptime(new, '%d %b %Y'))
return pd.Series([datetime.strptime(new, '%d %b %Y'), np.nan, np.nan])
elif len(new) == 4:
print(datetime.strptime(new, '%Y').year)
return pd.Series([np.nan, np.nan, datetime.strptime(new, '%Y').year])
else:
print(str(datetime.strptime(new, '%b %Y').year)) + '-' + str(datetime.strptime(new, '%b %Y').month)
mmyyyy=str(datetime.strptime(new, '%b %Y').year) + '-' + str(datetime.strptime(new, '%b %Y').month)
return pd.Series([np.nan, mmyyyy, np.nan])
I think you can extract
the dates and skip the DOB
:我认为您可以extract
日期并跳过DOB
:
pattern = r"(?P<date1>\d{2}\s[A-Za-z]{3}\s\d{4})|(?P<date2>[A-Za-z]{3}\s\d{4})|(?P<date3>\d{4})"
dates = df["date"].str[3:].str.extract(pattern)
dates[["date1","date2"]] = dates[["date1","date2"]].apply(pd.to_datetime)
print (dates)
date1 date2 date3
0 1951-06-19 NaT NaN
1 NaT 1951-06-01 NaN
2 NaT NaT 1951
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.