简体   繁体   中英

Extract and sort dates from Python dataframe using regex

I am working on a single-column pandas data frame consists of thousands (rows) of string expression. Each string may contain "date" information of different formats, for instance:

05/10/2001; 05/10/01; 5/10/09; 6/2/01
May-10-2001; May 10, 2010; March 25, 2001; Mar. 25, 2001; Mar 25 2001;
25 Mar 2001; 25 March 2001; 25 Mar. 2001; 25 March, 2001
Mar 25th, 2001; Mar 25th, 2001; Mar 12nd, 2001
Feb 2001; Sep 2001; Oct 2001
5/2001; 11/2001
2001; 2015

To use a couple of strings as examples:

 df[0] he plans to depart on 6/12/95 df[1] as of Mar. 23rd, 2011, the board decides that... df[2] the 12-28-01 record shows... 

I would like to use a findall() function after df, such that df.str.findall(r'') extracts date elements:

 [0] 6/12/95 [1] Mar. 23rd, 2011 [2] 12-28-01 

from the original strings, followed by some 'sort' command line to sort the extracted dates in chronological order by their indices, so that the output should look like

 [0] 1 [1] 3 [2] 2 

I (tentatively) use the following function

 df.str.findall(r'(?:\\d{2} )?(?:Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)[az]* (?:\\d{2}, )?\\d{4}') 

but have no clue as to how to deal with

(1) ordinal indicator after digits: st, th, nd

(2) the occasional "." values representing abbreviation, and

(3) slash (/) and hyphen (-)

using regex final function in one go.

Also, after all the extraction works are done, I want to sort them in chronological order with their respective indices (ie, 1, 2, 3,..., n). But my currently knowledge of regex is insufficient to know how Python is able to sort these different date format in chronological order.

It will be really appreciated if someone could enlighten me with some handy tricks on the .findall() function for this or explain the mechanisms for sorting date expressions.

dateutil.parser.parse could help you to avoid regex - it's surelly a good thing to do here.

It basically takes a string and tries to parse it in datetime object and that's great because datetime can be sorted easily.

from dateutil.parser import parse

data = """05/10/2001; 05/10/01; 5/10/09; 6/2/01
May-10-2001; May 10, 2010; March 25, 2001; Mar. 25, 2001; Mar 25 2001;
25 Mar 2001; 25 March 2001; 25 Mar. 2001; 25 March, 2001
Mar 25th, 2001; Mar 25th, 2001; Mar 12nd, 2001
Feb 2001; Sep 2001; Oct 2001
5/2001; 11/2001
2001; 2015"""

# Parse data into list of strings
data = data.replace('\n', ';').split(';')

dates = []
for line in data:
    try:
        dates.append(parse(line))
    except TypeError:
        # it's not parsable
        pass

print list(sorted(dates))

Cutted output:

[datetime.datetime(2001, 2, 4, 0, 0), datetime.datetime(2001, 3, 12, 0, 0), datetime.datetime(2001, 3, 25, 0, 0), datetime.datetime(2001, 3, 25, 0, 0) ...]

As you can see you win on two points:

  1. It's really easy to sort datetime objects
  2. You don't have to trust any long & complex regex pattern to know if a string is a date, parse do it for you

I would try using the two following two modules. dateutil in this answer:

Extracting date from a string in Python

and/or dateparser:

https://dateparser.readthedocs.io/en/latest/

试试这个“”“(r'(?:\\ d {1,2} [] [/ - ] )?(?:(?:Jan | Feb | Mar | Apr | May | Jun | Jul | Aug | Sep | 10月| 11月| 12月)[az] *)?(?:\\ d {1,2} [/ - ])?\\ d {2,4}')“”“

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM