简体   繁体   English

从文本列中提取日期 Pandas

[英]Extract date from text column Pandas

How to extract date from text column: ex:如何从文本列中提取日期:例如:

import pandas as pd

data = [
    [1, "NOV 20/00 I have a date"],
    [2, "DEC 20 I am going to shopping"],
    [3, "I done with all the things"],
    [4, "NOV 10 2021 YES I AM"],
    [5, "JAN/20/2020 - WILL CALL IN DIRECTIONS"],
]

chk = pd.DataFrame(data, columns = ['id', 'strin'])

Required Output:必填 Output:

1 2000-11-20
2 
3 
4 2021-11-10
5 2020-01-20

Not the most pleasant solution, but this 'works'.这不是最令人愉快的解决方案,但这“有效”。

I've only added month to month-number mappings for the two months (JAN and NOV) in your sample data, as I can't guess what your other mappings might be.我只在您的示例数据中添加了两个月(JAN 和 NOV)的月份到月份数字映射,因为我无法猜测您的其他映射可能是什么。 Also no reason you can't add multiple mappings for each month, eg,: 'FBR': 2, 'FEB': 2也没有理由不能为每个月添加多个映射,例如: 'FBR': 2, 'FEB': 2

Sample data setup:示例数据设置:

import pandas as pd

data = [
    [1, "NOV 20/00 I have a date"],
    [2, "DEC 20 I am going to shopping"],
    [3, "I done with all the things"],
    [4, "NOV 10 2021 YES I AM"],
    [5, "JAN/20/2020 - WILL CALL IN DIRECTIONS"],
    [6, "JAN/7/2020 This has single-digit day"]
]

chk = pd.DataFrame(data, columns = ['id', 'text'])

Note I called the 2nd column text as it made more sense to me than strin .请注意,我将第二列命名为text ,因为它对我来说比strin更有意义。 Added an extra row to test single-digit day matches.添加了一个额外的行来测试一位数的日期匹配。

Code to parse the dates where possible:尽可能解析日期的代码:

import datetime
monthMap = {"JAN" : 1, "NOV" : 11}
monthRegex = "|".join(monthMap.keys())
separatorRegex = "[ /]"
dateRegex = f"^({monthRegex}){separatorRegex}(\d\d?){separatorRegex}(\d\d\d?\d?)"

def mapMatchesToDate(matchRow):
    if matchRow:
        [month, day, year] = matchRow[0]
        month = monthMap[month]
        year = "20" + year if len(year) == 2 else year
        return datetime.date(int(year), int(month), int(day))
    else:
        return None

chk['date'] = chk['text'].str.findall(dateRegex).apply(mapMatchesToDate)

Outputs the pandas Series (ie, just a single column):输出 pandas系列(即只有一列):

0    2000-11-20
1          None
2          None
3    2021-11-10
4    2020-01-20
5    2020-01-07

This could be assigned to a column in the original data with (eg):这可以分配给原始数据中的一列(例如):

chk['date'] = chk['text'].str.findall(dateRegex).apply(mapMatchesToDate)

Meaning that chk is now:这意味着chk现在是:

   id                                   text        date
0   1                NOV 20/00 I have a date  2000-11-20
1   2          DEC 20 I am going to shopping        None
2   3             I done with all the things        None
3   4                   NOV 10 2021 YES I AM  2021-11-10
4   5  JAN/20/2020 - WILL CALL IN DIRECTIONS  2020-01-20
5   6   JAN/7/2020 This has single-digit day  2020-01-07

Problems/notes问题/注意事项

  1. This is not efficient pandas code , and makes very little use of any optimised pandas functionality.不是高效的 pandas 代码,并且很少使用任何优化的 pandas 功能。
  2. In the case that there are any three-digit numbers in the 'year' position, this won't work, eg, NOV 1 123, will make a date in the year 123如果“年”position 中有任何三位数字,这将不起作用,例如,NOV 1 123,将在 123 年创建日期
  3. Any values in the 'year' position with only two digits are assumed to be 2000..., so 17 is 2017, and 99 is 2099 , not 1999 “年份”position 中只有两位数字的任何值都假定为 2000...,因此 17 是 2017,99是 2099 ,而不是 1999
  4. Dates are only matched if they are the very first thing in the line.日期只有在行中排在第一位时才会匹配。 (This is controlled by the ^ character at the start of dateRegex (这由dateRegex开头的^字符控制
  5. No attempt to match partial dates (eg, DEC 20, in your sample data).不尝试匹配部分日期(例如,您的示例数据中的 12 月 20 日)。
  6. Single-digit days (eg "1") are allowed (shown the extra final row in sample data).允许使用个位数天数(例如“1”)(显示在示例数据的最后一行)。 Single-digit years are not.个位数年份不是。
  7. The only separator characters allowed between date components are space and / .日期组件之间唯一允许的分隔符是space/ These are defined in separatorRegex这些在separatorRegex中定义

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM