[英]Extract date from text column Pandas
如何从文本列中提取日期:例如:
import pandas as pd
data = [
[1, "NOV 20/00 I have a date"],
[2, "DEC 20 I am going to shopping"],
[3, "I done with all the things"],
[4, "NOV 10 2021 YES I AM"],
[5, "JAN/20/2020 - WILL CALL IN DIRECTIONS"],
]
chk = pd.DataFrame(data, columns = ['id', 'strin'])
必填 Output:
1 2000-11-20
2
3
4 2021-11-10
5 2020-01-20
这不是最令人愉快的解决方案,但这“有效”。
我只在您的示例数据中添加了两个月(JAN 和 NOV)的月份到月份数字映射,因为我无法猜测您的其他映射可能是什么。 也没有理由不能为每个月添加多个映射,例如: 'FBR': 2, 'FEB': 2
示例数据设置:
import pandas as pd
data = [
[1, "NOV 20/00 I have a date"],
[2, "DEC 20 I am going to shopping"],
[3, "I done with all the things"],
[4, "NOV 10 2021 YES I AM"],
[5, "JAN/20/2020 - WILL CALL IN DIRECTIONS"],
[6, "JAN/7/2020 This has single-digit day"]
]
chk = pd.DataFrame(data, columns = ['id', 'text'])
请注意,我将第二列命名为text
,因为它对我来说比strin
更有意义。 添加了一个额外的行来测试一位数的日期匹配。
尽可能解析日期的代码:
import datetime
monthMap = {"JAN" : 1, "NOV" : 11}
monthRegex = "|".join(monthMap.keys())
separatorRegex = "[ /]"
dateRegex = f"^({monthRegex}){separatorRegex}(\d\d?){separatorRegex}(\d\d\d?\d?)"
def mapMatchesToDate(matchRow):
if matchRow:
[month, day, year] = matchRow[0]
month = monthMap[month]
year = "20" + year if len(year) == 2 else year
return datetime.date(int(year), int(month), int(day))
else:
return None
chk['date'] = chk['text'].str.findall(dateRegex).apply(mapMatchesToDate)
输出 pandas系列(即只有一列):
0 2000-11-20
1 None
2 None
3 2021-11-10
4 2020-01-20
5 2020-01-07
这可以分配给原始数据中的一列(例如):
chk['date'] = chk['text'].str.findall(dateRegex).apply(mapMatchesToDate)
这意味着chk
现在是:
id text date
0 1 NOV 20/00 I have a date 2000-11-20
1 2 DEC 20 I am going to shopping None
2 3 I done with all the things None
3 4 NOV 10 2021 YES I AM 2021-11-10
4 5 JAN/20/2020 - WILL CALL IN DIRECTIONS 2020-01-20
5 6 JAN/7/2020 This has single-digit day 2020-01-07
dateRegex
开头的^
字符控制space
和/
。 这些在separatorRegex
中定义
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.