简体   繁体   中英

Regular expression for dates in R

I am trying to create a regular expression in R that will search for dates within some text. Since I cannot control the actual date format, I am trying to "catch" all the possible dd/mm/yy formats (one or two digit months, two or four digit years, optional 1 or two digit days, with a range of separators ("/", "-", "."), possibly containing spaces).

My regular expression so far is:

pattern = "(\\d{0,2}[/\\.-])?[ ]?(\\d{1,2}[ ]*[/\\.-]|January|February|March|April|May|June|July|August|September|October|November|December|Jan|Feb|Febr|Mar|Apr|Jun|Jul|Aug|Sept|Sep|Oct|Nov|Dec|Jan\\.|Feb\\.|Febr\\.|Mar\\.|Apr\\.|Jun\\.|Jul\\.|Aug\\.|Sept\\.|Sep\\.|Oct\\.|Nov\\.|Dec\\.)[ ]*[']?\\d{2,4}"

This seems to work on most formats, but it contains a bug that I find hard to understand:

str_extract_all("09/11 /1985", pattern = pattern) # returns: "09/11 /1985"
str_extract_all(" 09/11 /1985", pattern = pattern) # returns: c("09/11",  "1985")

This sounds extremely weird. Since I am not including lookarounds, the extra space in the start should make no difference. The results say otherwise. What am I doing wrong?

The problems lies in the first part of your regex, where you probably try to match the days: (\\\\d{0,2}[/\\\\.-])?[ ]? It is optionally matching 0 to 2 days followed by one of your delimiters. Then it's optionally matching a space.

In the case of 09/11 /1985 this part matches the leading space, leaving 09 to be matched as month and 11 as year.

To get rid of this behaviour, you should move the space into the optional group. You might also want to match 1 or 2 digits, otherwise it will match leading delimiters.

So I would rewrite this first part to (\\\\d{1,2}[/\\\\.-][ ]?)?

There are a few other points you could improve, eg:

  • January|Jan|Jan\\\\. is the same as Jan(?:\\\\.|uary)?
  • consider using non capturing groups

I think the best thing would be to know the date format used in the given string prior to reading the file and then test if the date format is always as expected. However, as the OP states this is not the case. Here is a not exhaustive list of date formats, but it should give you an impression, that it can be tedious work to figure out a regex that only allows valid dates. Also, format guessing can make make your scripts somewhat unpredictable for someone who does not understand in detail how the guessing is done.

If you still think you need to use regex for different date formats try to design it in a way that makes it clear to the reader which one format is given priority:

(?:format1)|(?:format2)|...|(?:formatN)

In this case format1 would have priority over

There are also quite nice regexes on https://stackoverflow.com/a/15504877/6018688 that do some nice date validity checking these formats even accounting for leap years dd/mm/yyyy , dd-mm-yyyy or dd.mm.yyyy .

^(?:(?:31(\\/|-|\\.)(?:0?[13578]|1[02]))\\1|(?:(?:29|30)(\\/|-|\\.)(?:0?[1,3-9]|1[0-2])\\2))(?:(?:1[6-9]|[2-9]\\d)?\\d{2})$|^(?:29(\\/|-|\\.)0?2\\3(?:(?:(?:1[6-9]|[2-9]\\d)?(?:0[48]|[2468][048]|[13579][26])|(?:(?:16|[2468][048]|[3579][26])00))))$|^(?:0?[1-9]|1\\d|2[0-8])(\\/|-|\\.)(?:(?:0?[1-9])|(?:1[0-2]))\\4(?:(?:1[6-9]|[2-9]\\d)?\\d{2})$

and from the same Question, a different answer with month names:

^(?:(?:31(\\/|-|\\.)(?:0?[13578]|1[02]|(?:Jan|Mar|May|Jul|Aug|Oct|Dec)))\\1|(?:(?:29|30)(\\/|-|\\.)(?:0?[1,3-9]|1[0-2]|(?:Jan|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec))\\2))(?:(?:1[6-9]|[2-9]\\d)?\\d{2})$|^(?:29(\\/|-|\\.)(?:0?2|(?:Feb))\\3(?:(?:(?:1[6-9]|[2-9]\\d)?(?:0[48]|[2468][048]|[13579][26])|(?:(?:16|[2468][048]|[3579][26])00))))$|^(?:0?[1-9]|1\\d|2[0-8])(\\/|-|\\.)(?:(?:0?[1-9]|(?:Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep))|(?:1[0-2]|(?:Oct|Nov|Dec)))\\4(?:(?:1[6-9]|[2-9]\\d)?\\d{2})$

I think you get an impression now, how convoluted it can be to write a regex that actually does what you intend to do perfectly. I would really try to keep the allowed dates to a minimum and aim for a quite restrictive regex. In your example, you give strings only containing dates (and spaces), nothing else. If this is also the case, you should try to math the whole string with "^yourregex$" , if you want to allow for spaces at the beginning and end of string use "^\\s*yourregex\\s*$" . Since you have one example with spaces at the beginning of the string, i use the latter for further development.

In your case I would start with only years:

"^\\\\s*(?:\\\\d{4})\\\\s*$"

Then allow the other stuff mm-dd-YY (no checking if it is indeed a valid date or maybe "33-13-2016", but would also allow 2 digit year number)

"(?:\\\\d{1,2}[/.-]\\\\d{1,2}[/.-](?:\\\\d{4}|\\\\d{2})"

and if you want to allow space between the delimiters:

"(?:\\\\d{1,2}\\\\s*[/.-]\\\\s*\\\\d{1,2}\\\\s*[/.-]\\\\s*\\\\d{4})"

Then formats with written or abbreviated month names:

"(\\\\d{1,2}\\\\s*[/.-]?\\\\s*(?:January|February|March|April|May|June|July|August|September|October|November|December|Jan|Feb|Febr|Mar|Apr|Jun|Jul|Aug|Sept|Sep|Oct|Nov|Dec|Jan\\\\.|Feb\\\\.|Febr\\\\.|Mar\\\\.|Apr\\\\.|Jun\\\\.|Jul\\\\.|Aug\\\\.|Sept\\\\.|Sep\\\\.|Oct\\\\.|Nov\\\\.|Dec\\\\.)\\\\s*[/.-]?\\\\s*(?:'?\\\\d{2}|\\\\d{4}))"

Put together:

"^\\\\s*(?:\\\\d{4}$)|(?:\\\\d{1,2}\\\\s*[/.-]\\\\s*\\\\d{1,2}\\\\s*[/.-]\\\\s*\\\\d{4})|(\\\\d{1,2}\\\\s*[/.-]?\\\\s*(?:January|February|March|April|May|June|July|August|September|October|November|December|Jan|Feb|Febr|Mar|Apr|Jun|Jul|Aug|Sept|Sep|Oct|Nov|Dec|Jan\\\\.|Feb\\\\.|Febr\\\\.|Mar\\\\.|Apr\\\\.|Jun\\\\.|Jul\\\\.|Aug\\\\.|Sept\\\\.|Sep\\\\.|Oct\\\\.|Nov\\\\.|Dec\\\\.)\\\\s*[/.-]?\\\\s*(?:'?\\\\d{2}|\\\\d{4}))\\\\s*$"

This way you can chain as many formats as you wish.

Please compare the following regex with a yours to check the behavior on different input strings. I added word boundary \\b constraints, since you used str_extract_all I assume there can be multiple dates in the same string.

string = "only a year 1985. No space 2.Jan.2016. 2. Jan. 2016. 2. Jan. '16 2/1/16 02/01/2016 19855 ID1985A 2. Jan 2016   2.. Jan 2016 1January2016 2-Jan.-2016 2-Jan-2016 2.\tJan.\t2016"
pattern = "(\\d{1,2}[/\\.-][ ]?)?(\\d{1,2}[ ]*[/\\.-]|January|February|March|April|May|June|July|August|September|October|November|December|Jan|Feb|Febr|Mar|Apr|Jun|Jul|Aug|Sept|Sep|Oct|Nov|Dec|Jan\\.|Feb\\.|Febr\\.|Mar\\.|Apr\\.|Jun\\.|Jul\\.|Aug\\.|Sept\\.|Sep\\.|Oct\\.|Nov\\.|Dec\\.)[ ]*[']?\\d{2,4}"
p="\\s*(?:\\b\\d{4}\\b)|(?:\\b\\d{1,2}\\s*[/\\.-]\\s*\\d{1,2}\\s*[/\\.-]\\s*(?:\\d{4}|\\d{2})\\b)|\\b\\d{1,2}\\s*[/\\.-]?\\s*(?:January|February|March|April|May|June|July|August|September|October|November|December|(?:Jan|Feb|Febr|Mar|Apr|Jun|Jul|Aug|Sept|Sep|Oct|Nov|Dec).?)\\s*[/\\.-]?\\s*(?:\\d{4}|'?\\d{2})\\b\\s*"
str_extract_all(string, pattern=pattern)
str_extract_all(string, pattern=p)

A word of warning: When allowing multiple versions of different formats with spaces, you allow for variances that make it hard to guarantee that only dates are matched and not some other numeric values in the text.

Escaping the dot in character group is unnecessary as in [\\.] should only be [.]; except if you also want to allow a backslash as delimiter of the between the day\\mont\\year. When the input format is variable, space can also be a tab \\t so replacing [ ] with \\s (which matches any space character except line terminators like \\n ) seems to be a good idea.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM