[英]Extract date from text column Pandas
How to extract date from text column: ex:如何从文本列中提取日期:例如:
import pandas as pd
data = [
[1, "NOV 20/00 I have a date"],
[2, "DEC 20 I am going to shopping"],
[3, "I done with all the things"],
[4, "NOV 10 2021 YES I AM"],
[5, "JAN/20/2020 - WILL CALL IN DIRECTIONS"],
]
chk = pd.DataFrame(data, columns = ['id', 'strin'])
Required Output:必填 Output:
1 2000-11-20
2
3
4 2021-11-10
5 2020-01-20
Not the most pleasant solution, but this 'works'.这不是最令人愉快的解决方案,但这“有效”。
I've only added month to month-number mappings for the two months (JAN and NOV) in your sample data, as I can't guess what your other mappings might be.我只在您的示例数据中添加了两个月(JAN 和 NOV)的月份到月份数字映射,因为我无法猜测您的其他映射可能是什么。 Also no reason you can't add multiple mappings for each month, eg,:
'FBR': 2, 'FEB': 2
也没有理由不能为每个月添加多个映射,例如:
'FBR': 2, 'FEB': 2
Sample data setup:示例数据设置:
import pandas as pd
data = [
[1, "NOV 20/00 I have a date"],
[2, "DEC 20 I am going to shopping"],
[3, "I done with all the things"],
[4, "NOV 10 2021 YES I AM"],
[5, "JAN/20/2020 - WILL CALL IN DIRECTIONS"],
[6, "JAN/7/2020 This has single-digit day"]
]
chk = pd.DataFrame(data, columns = ['id', 'text'])
Note I called the 2nd column text
as it made more sense to me than strin
.请注意,我将第二列命名为
text
,因为它对我来说比strin
更有意义。 Added an extra row to test single-digit day matches.添加了一个额外的行来测试一位数的日期匹配。
Code to parse the dates where possible:尽可能解析日期的代码:
import datetime
monthMap = {"JAN" : 1, "NOV" : 11}
monthRegex = "|".join(monthMap.keys())
separatorRegex = "[ /]"
dateRegex = f"^({monthRegex}){separatorRegex}(\d\d?){separatorRegex}(\d\d\d?\d?)"
def mapMatchesToDate(matchRow):
if matchRow:
[month, day, year] = matchRow[0]
month = monthMap[month]
year = "20" + year if len(year) == 2 else year
return datetime.date(int(year), int(month), int(day))
else:
return None
chk['date'] = chk['text'].str.findall(dateRegex).apply(mapMatchesToDate)
Outputs the pandas Series (ie, just a single column):输出 pandas系列(即只有一列):
0 2000-11-20
1 None
2 None
3 2021-11-10
4 2020-01-20
5 2020-01-07
This could be assigned to a column in the original data with (eg):这可以分配给原始数据中的一列(例如):
chk['date'] = chk['text'].str.findall(dateRegex).apply(mapMatchesToDate)
Meaning that chk
is now:这意味着
chk
现在是:
id text date
0 1 NOV 20/00 I have a date 2000-11-20
1 2 DEC 20 I am going to shopping None
2 3 I done with all the things None
3 4 NOV 10 2021 YES I AM 2021-11-10
4 5 JAN/20/2020 - WILL CALL IN DIRECTIONS 2020-01-20
5 6 JAN/7/2020 This has single-digit day 2020-01-07
^
character at the start of dateRegex
dateRegex
开头的^
字符控制space
and /
.space
和/
。 These are defined in separatorRegex
separatorRegex
中定义
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.