简体   繁体   中英

How to extract date from (relatively) unstructured text [R]

I'm having difficulty extracting dates from a string. The string can look one of several ways, but will always include some form of:

<full month name> <numeric date>, <year>

As in:

DECEMBER 4, 2011

However, the text at the beginning of the string ranges widely, taking forms like all of these:

THE PUBLIC SCHEDULE FOR MAYOR RAHM EMANUEL JUNE 9, 2011
THE PUBLIC SCHEDULE FOR MAYOR RAHM EMANUEL FOR OCTOBER 29 & OCTOBER 30, 2011
The Public Schedule for Mayor Rahm Emanuel December 17, 2011 through January 2, 2012
The Public Schedule for Mayor Rahm Emanuel December 8th and 9th, 2012
The Public Schedule for Mayor Rahm Emanuel – March 13, 2013

These variations are really throwing me off. Ordinarily, I would just get rid of the first X characters of the string, and use the remainder as my date, but because the formatting keeps changing this isn't possible. I have been attempting variations of this, but I end up creating dates with just as many problems.

It seems like grep() might be the function to use here, but I don't really understand how I could create a pattern which would capture these dates, or how to use its output.

Thank you for any help!

This is more or less just a heuristic. If you remove everything up to the month, we'll get something more manageable. Let's assume your example lines are in a variable b :

months.regex <- paste(month.name, collapse='|')
d <- gsub(paste0(".*(", months.regex, ")"), "\\1", 
          b[grep(months.regex, b, TRUE)], TRUE)

This picks only lines with a month and remove everything up to the month:

> d
[1] "JUNE 9, 2011"               "OCTOBER 30, 2011"          
[3] "January 2, 2012"            "December 8th and 9th, 2012"
[5] "March 13, 2013"            

The month and year are reasonably easy to extract:

month <- match(tolower(gsub("\\s.*", "", d)), tolower(month.name))
day <- gsub("\\S+\\s+(.*),.*", "\\1", d)
year <- as.integer(gsub(".*,\\s*(\\d{4})", "\\1", d))

The real problem are the free-form days and multiple dates. There is no perfect way - the above will always pick the last date if there is more than one month in the line. To reduce the multiple days, you could use something like

day <- as.integer(gsub("\\D.*", "", day))

which will pick the first day if there is more than one. The full result is then:

> paste(month.name[month], day, year)
[1] "June 9 2011"     "October 30 2011" "January 2 2012"  "December 8 2012"
[5] "March 13 2013"  

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM