简体   繁体   English

R中日期的正则表达式

[英]Regular expression for dates in R

I am trying to create a regular expression in R that will search for dates within some text. 我正在尝试在R中创建一个正则表达式,以在某些文本中搜索日期。 Since I cannot control the actual date format, I am trying to "catch" all the possible dd/mm/yy formats (one or two digit months, two or four digit years, optional 1 or two digit days, with a range of separators ("/", "-", "."), possibly containing spaces). 由于我无法控制实际的日期格式,因此我试图“捕获”所有可能的dd / mm / yy格式(一个或两位数的月份,两位或四位的年份,可选的一位或两位的日期以及一系列分隔符) (“ /”,“-”,“。”),可能包含空格)。

My regular expression so far is: 到目前为止,我的正则表达式是:

pattern = "(\\d{0,2}[/\\.-])?[ ]?(\\d{1,2}[ ]*[/\\.-]|January|February|March|April|May|June|July|August|September|October|November|December|Jan|Feb|Febr|Mar|Apr|Jun|Jul|Aug|Sept|Sep|Oct|Nov|Dec|Jan\\.|Feb\\.|Febr\\.|Mar\\.|Apr\\.|Jun\\.|Jul\\.|Aug\\.|Sept\\.|Sep\\.|Oct\\.|Nov\\.|Dec\\.)[ ]*[']?\\d{2,4}"

This seems to work on most formats, but it contains a bug that I find hard to understand: 这似乎适用于大多数格式,但是它包含一个我很难理解的错误:

str_extract_all("09/11 /1985", pattern = pattern) # returns: "09/11 /1985"
str_extract_all(" 09/11 /1985", pattern = pattern) # returns: c("09/11",  "1985")

This sounds extremely weird. 这听起来很奇怪。 Since I am not including lookarounds, the extra space in the start should make no difference. 由于我不包括环顾四周,因此开始时的额外空间应该没有影响。 The results say otherwise. 结果不然。 What am I doing wrong? 我究竟做错了什么?

The problems lies in the first part of your regex, where you probably try to match the days: (\\\\d{0,2}[/\\\\.-])?[ ]? 问题出在正则表达式的第一部分,您可能会尝试匹配日期: (\\\\d{0,2}[/\\\\.-])?[ ]? It is optionally matching 0 to 2 days followed by one of your delimiters. 您可以选择匹配0到2天,然后匹配其中一个定界符。 Then it's optionally matching a space. 然后可以选择匹配一个空格。

In the case of 09/11 /1985 this part matches the leading space, leaving 09 to be matched as month and 11 as year. 09/11 /1985 9月11日的情况下,此部分与前导空格匹配,因此将09匹配为月,将11匹配为年。

To get rid of this behaviour, you should move the space into the optional group. 要摆脱这种行为,您应该将空格移到可选组中。 You might also want to match 1 or 2 digits, otherwise it will match leading delimiters. 您可能还希望匹配1或2位数字,否则它将匹配前导定界符。

So I would rewrite this first part to (\\\\d{1,2}[/\\\\.-][ ]?)? 因此,我会将第一部分重写为(\\\\d{1,2}[/\\\\.-][ ]?)?

There are a few other points you could improve, eg: 您还可以改善其他几点,例如:

  • January|Jan|Jan\\\\. is the same as Jan(?:\\\\.|uary)? Jan(?:\\\\.|uary)?
  • consider using non capturing groups 考虑使用非捕获组

I think the best thing would be to know the date format used in the given string prior to reading the file and then test if the date format is always as expected. 我认为最好的办法是在读取文件之前先了解给定字符串中使用的日期格式,然后测试日期格式是否始终符合预期。 However, as the OP states this is not the case. 但是,正如OP所述,情况并非如此。 Here is a not exhaustive list of date formats, but it should give you an impression, that it can be tedious work to figure out a regex that only allows valid dates. 这不是日期格式的详尽列表,但应该给您一种印象,即找出仅允许有效日期的正则表达式可能是一件繁琐的工作。 Also, format guessing can make make your scripts somewhat unpredictable for someone who does not understand in detail how the guessing is done. 另外,格式猜测可能会使不了解细节的人难以预测您的脚本。

If you still think you need to use regex for different date formats try to design it in a way that makes it clear to the reader which one format is given priority: 如果您仍然认为您需要对不同的日期格式使用正则表达式,请尝试以一种易于读者理解的方式设计正则表达式:

(?:format1)|(?:format2)|...|(?:formatN)

In this case format1 would have priority over 在这种情况下,format1的优先级高于

There are also quite nice regexes on https://stackoverflow.com/a/15504877/6018688 that do some nice date validity checking these formats even accounting for leap years dd/mm/yyyy , dd-mm-yyyy or dd.mm.yyyy . https://stackoverflow.com/a/15504877/6018688上也有相当不错的正则表达式,即使对checking年dd/mm/yyyydd-mm-yyyydd.mm.yyyy计算,它们也可以很好地检查这些格式的日期有效性。 dd.mm.yyyy

^(?:(?:31(\\/|-|\\.)(?:0?[13578]|1[02]))\\1|(?:(?:29|30)(\\/|-|\\.)(?:0?[1,3-9]|1[0-2])\\2))(?:(?:1[6-9]|[2-9]\\d)?\\d{2})$|^(?:29(\\/|-|\\.)0?2\\3(?:(?:(?:1[6-9]|[2-9]\\d)?(?:0[48]|[2468][048]|[13579][26])|(?:(?:16|[2468][048]|[3579][26])00))))$|^(?:0?[1-9]|1\\d|2[0-8])(\\/|-|\\.)(?:(?:0?[1-9])|(?:1[0-2]))\\4(?:(?:1[6-9]|[2-9]\\d)?\\d{2})$

and from the same Question, a different answer with month names: 在同一个问题中,使用月名称的不同答案:

^(?:(?:31(\\/|-|\\.)(?:0?[13578]|1[02]|(?:Jan|Mar|May|Jul|Aug|Oct|Dec)))\\1|(?:(?:29|30)(\\/|-|\\.)(?:0?[1,3-9]|1[0-2]|(?:Jan|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec))\\2))(?:(?:1[6-9]|[2-9]\\d)?\\d{2})$|^(?:29(\\/|-|\\.)(?:0?2|(?:Feb))\\3(?:(?:(?:1[6-9]|[2-9]\\d)?(?:0[48]|[2468][048]|[13579][26])|(?:(?:16|[2468][048]|[3579][26])00))))$|^(?:0?[1-9]|1\\d|2[0-8])(\\/|-|\\.)(?:(?:0?[1-9]|(?:Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep))|(?:1[0-2]|(?:Oct|Nov|Dec)))\\4(?:(?:1[6-9]|[2-9]\\d)?\\d{2})$

I think you get an impression now, how convoluted it can be to write a regex that actually does what you intend to do perfectly. 我认为您现在已经有了印象,编写一个实际上可以完成您打算做的事情的正则表达式是多么令人费解。 I would really try to keep the allowed dates to a minimum and aim for a quite restrictive regex. 我真的会尽量将允许的日期保持在最低限度,并寻求一个限制性很强的正则表达式。 In your example, you give strings only containing dates (and spaces), nothing else. 在您的示例中,您只给出了仅包含日期(和空格)的字符串,而没有其他内容。 If this is also the case, you should try to math the whole string with "^yourregex$" , if you want to allow for spaces at the beginning and end of string use "^\\s*yourregex\\s*$" . 如果是这种情况,则应尝试使用"^yourregex$"对整个字符串进行数学运算,如果要在字符串的开头和结尾"^\\s*yourregex\\s*$"空格,请使用"^\\s*yourregex\\s*$" Since you have one example with spaces at the beginning of the string, i use the latter for further development. 由于您在字符串的开头有一个带空格的示例,因此我将使用后者进行进一步的开发。

In your case I would start with only years: 在您的情况下,我将仅以几年开始:

"^\\\\s*(?:\\\\d{4})\\\\s*$"

Then allow the other stuff mm-dd-YY (no checking if it is indeed a valid date or maybe "33-13-2016", but would also allow 2 digit year number) 然后允许使用其他东西mm-dd-YY(不检查它是否确实是有效日期或“ 33-13-2016”,但也可以使用两位数的年份)

"(?:\\\\d{1,2}[/.-]\\\\d{1,2}[/.-](?:\\\\d{4}|\\\\d{2})"

and if you want to allow space between the delimiters: 如果要在定界符之间留出空间:

"(?:\\\\d{1,2}\\\\s*[/.-]\\\\s*\\\\d{1,2}\\\\s*[/.-]\\\\s*\\\\d{4})"

Then formats with written or abbreviated month names: 然后使用书面或缩写月份名称进行格式化:

"(\\\\d{1,2}\\\\s*[/.-]?\\\\s*(?:January|February|March|April|May|June|July|August|September|October|November|December|Jan|Feb|Febr|Mar|Apr|Jun|Jul|Aug|Sept|Sep|Oct|Nov|Dec|Jan\\\\.|Feb\\\\.|Febr\\\\.|Mar\\\\.|Apr\\\\.|Jun\\\\.|Jul\\\\.|Aug\\\\.|Sept\\\\.|Sep\\\\.|Oct\\\\.|Nov\\\\.|Dec\\\\.)\\\\s*[/.-]?\\\\s*(?:'?\\\\d{2}|\\\\d{4}))"

Put together: 放在一起:

"^\\\\s*(?:\\\\d{4}$)|(?:\\\\d{1,2}\\\\s*[/.-]\\\\s*\\\\d{1,2}\\\\s*[/.-]\\\\s*\\\\d{4})|(\\\\d{1,2}\\\\s*[/.-]?\\\\s*(?:January|February|March|April|May|June|July|August|September|October|November|December|Jan|Feb|Febr|Mar|Apr|Jun|Jul|Aug|Sept|Sep|Oct|Nov|Dec|Jan\\\\.|Feb\\\\.|Febr\\\\.|Mar\\\\.|Apr\\\\.|Jun\\\\.|Jul\\\\.|Aug\\\\.|Sept\\\\.|Sep\\\\.|Oct\\\\.|Nov\\\\.|Dec\\\\.)\\\\s*[/.-]?\\\\s*(?:'?\\\\d{2}|\\\\d{4}))\\\\s*$"

This way you can chain as many formats as you wish. 这样,您可以根据需要链接任意多种格式。

Please compare the following regex with a yours to check the behavior on different input strings. 请将以下正则表达式与您的正则表达式进行比较,以检查不同输入字符串上的行为。 I added word boundary \\b constraints, since you used str_extract_all I assume there can be multiple dates in the same string. 我添加了字边界\\b约束,因为您使用了str_extract_all,所以我认为同一字符串中可以有多个日期。

string = "only a year 1985. No space 2.Jan.2016. 2. Jan. 2016. 2. Jan. '16 2/1/16 02/01/2016 19855 ID1985A 2. Jan 2016   2.. Jan 2016 1January2016 2-Jan.-2016 2-Jan-2016 2.\tJan.\t2016"
pattern = "(\\d{1,2}[/\\.-][ ]?)?(\\d{1,2}[ ]*[/\\.-]|January|February|March|April|May|June|July|August|September|October|November|December|Jan|Feb|Febr|Mar|Apr|Jun|Jul|Aug|Sept|Sep|Oct|Nov|Dec|Jan\\.|Feb\\.|Febr\\.|Mar\\.|Apr\\.|Jun\\.|Jul\\.|Aug\\.|Sept\\.|Sep\\.|Oct\\.|Nov\\.|Dec\\.)[ ]*[']?\\d{2,4}"
p="\\s*(?:\\b\\d{4}\\b)|(?:\\b\\d{1,2}\\s*[/\\.-]\\s*\\d{1,2}\\s*[/\\.-]\\s*(?:\\d{4}|\\d{2})\\b)|\\b\\d{1,2}\\s*[/\\.-]?\\s*(?:January|February|March|April|May|June|July|August|September|October|November|December|(?:Jan|Feb|Febr|Mar|Apr|Jun|Jul|Aug|Sept|Sep|Oct|Nov|Dec).?)\\s*[/\\.-]?\\s*(?:\\d{4}|'?\\d{2})\\b\\s*"
str_extract_all(string, pattern=pattern)
str_extract_all(string, pattern=p)

A word of warning: When allowing multiple versions of different formats with spaces, you allow for variances that make it hard to guarantee that only dates are matched and not some other numeric values in the text. 一个警告:当允许带有空格的不同格式的多个版本时,您将允许难以保证仅日期匹配且文本中不包含其他数字值的差异。

Escaping the dot in character group is unnecessary as in [\\.] should only be [.]; 不需要在字符组中转义点,因为[\\。]中的点只能是[。]; except if you also want to allow a backslash as delimiter of the between the day\\mont\\year. 除非您还希望以反斜杠作为day \\ mont \\ year之间的分隔符。 When the input format is variable, space can also be a tab \\t so replacing [ ] with \\s (which matches any space character except line terminators like \\n ) seems to be a good idea. 当输入格式可变时,空格也可以是\\t制表符,因此用\\s替换[ ] (匹配除\\n等行终止符之外的任何空格字符)似乎是一个好主意。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM