简体   繁体   English

使用regex从Python数据框中提取和排序日期

[英]Extract and sort dates from Python dataframe using regex

I am working on a single-column pandas data frame consists of thousands (rows) of string expression. 我正在研究单列pandas数据帧,包含数千(行)的字符串表达式。 Each string may contain "date" information of different formats, for instance: 每个字符串可能包含不同格式的“日期”信息,例如:

05/10/2001; 05/10/01; 5/10/09; 6/2/01
May-10-2001; May 10, 2010; March 25, 2001; Mar. 25, 2001; Mar 25 2001;
25 Mar 2001; 25 March 2001; 25 Mar. 2001; 25 March, 2001
Mar 25th, 2001; Mar 25th, 2001; Mar 12nd, 2001
Feb 2001; Sep 2001; Oct 2001
5/2001; 11/2001
2001; 2015

To use a couple of strings as examples: 要使用几个字符串作为示例:

 df[0] he plans to depart on 6/12/95 df[1] as of Mar. 23rd, 2011, the board decides that... df[2] the 12-28-01 record shows... 

I would like to use a findall() function after df, such that df.str.findall(r'') extracts date elements: 我想在df之后使用findall()函数,这样df.str.findall(r'')就会提取日期元素:

 [0] 6/12/95 [1] Mar. 23rd, 2011 [2] 12-28-01 

from the original strings, followed by some 'sort' command line to sort the extracted dates in chronological order by their indices, so that the output should look like 从原始字符串开始,然后是一些'sort'命令行,按照时间顺序按提取的日期对它们的索引进行排序,这样输出应该看起来像

 [0] 1 [1] 3 [2] 2 

I (tentatively) use the following function 我(暂时)使用以下功能

 df.str.findall(r'(?:\\d{2} )?(?:Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)[az]* (?:\\d{2}, )?\\d{4}') 

but have no clue as to how to deal with 但不知道如何处理

(1) ordinal indicator after digits: st, th, nd (1)数字后的序数指示符:st,th,nd

(2) the occasional "." (2)偶尔“。” values representing abbreviation, and 表示缩写的值,和

(3) slash (/) and hyphen (-) (3)斜杠(/)和连字符( - )

using regex final function in one go. 一次性使用正则表达式最终函数。

Also, after all the extraction works are done, I want to sort them in chronological order with their respective indices (ie, 1, 2, 3,..., n). 此外,在完成所有提取工作之后,我想按时间顺序对它们各自的索引(即1,2,3,...,n)进行排序。 But my currently knowledge of regex is insufficient to know how Python is able to sort these different date format in chronological order. 但是我目前对正则表达式的了解不足以知道Python如何按时间顺序对这些不同的日期格式进行排序。

It will be really appreciated if someone could enlighten me with some handy tricks on the .findall() function for this or explain the mechanisms for sorting date expressions. 如果有人可以在.findall()函数上使用一些方便的技巧来启发我,或者解释用于排序日期表达式的机制,将会非常感激。

dateutil.parser.parse could help you to avoid regex - it's surelly a good thing to do here. dateutil.parser.parse可以帮助你避免正则表达式 - 这肯定是一件好事。

It basically takes a string and tries to parse it in datetime object and that's great because datetime can be sorted easily. 它基本上需要一个字符串并尝试在datetime对象中解析它,这很好,因为datetime可以很容易地排序。

from dateutil.parser import parse

data = """05/10/2001; 05/10/01; 5/10/09; 6/2/01
May-10-2001; May 10, 2010; March 25, 2001; Mar. 25, 2001; Mar 25 2001;
25 Mar 2001; 25 March 2001; 25 Mar. 2001; 25 March, 2001
Mar 25th, 2001; Mar 25th, 2001; Mar 12nd, 2001
Feb 2001; Sep 2001; Oct 2001
5/2001; 11/2001
2001; 2015"""

# Parse data into list of strings
data = data.replace('\n', ';').split(';')

dates = []
for line in data:
    try:
        dates.append(parse(line))
    except TypeError:
        # it's not parsable
        pass

print list(sorted(dates))

Cutted output: 切割输出:

[datetime.datetime(2001, 2, 4, 0, 0), datetime.datetime(2001, 3, 12, 0, 0), datetime.datetime(2001, 3, 25, 0, 0), datetime.datetime(2001, 3, 25, 0, 0) ...]

As you can see you win on two points: 你可以看到你在两点上获胜:

  1. It's really easy to sort datetime objects 对日期时间对象进行排序非常容易
  2. You don't have to trust any long & complex regex pattern to know if a string is a date, parse do it for you 您不必信任任何长而复杂的正则表达式模式来知道字符串是否为日期, parse为您执行此操作

I would try using the two following two modules. 我会尝试使用以下两个模块。 dateutil in this answer: dateutil在这个答案中:

Extracting date from a string in Python 从Python中的字符串中提取日期

and/or dateparser: 和/或日期分析员:

https://dateparser.readthedocs.io/en/latest/ https://dateparser.readthedocs.io/en/latest/

试试这个“”“(r'(?:\\ d {1,2} [] [/ - ] )?(?:(?:Jan | Feb | Mar | Apr | May | Jun | Jul | Aug | Sep | 10月| 11月| 12月)[az] *)?(?:\\ d {1,2} [/ - ])?\\ d {2,4}')“”“

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM