简体   繁体   English

如何修复与此日期匹配的日期正则表达式

[英]How to fix this date regular expression that is matching things it should not

I am trying to scan some documents to find dates for a classification problem. 我正在尝试扫描一些文档以查找分类问题的日期。 After reading around here and some other places I have constructed the following regular expression 在这里和其他一些地方阅读后,我构造了以下正则表达式

months='['+'|'.join(calendar.month_abbr[1:])+'|'+'|'.join(calendar.month_name[1:])+']'
techPart='+\\.*\\s*\\d{1,2}[,]?[\\s*][1|2]\\d{3}'
dateExpr=months+techPart

I am testing it on this string 我正在对此字符串进行测试

newString='Mar. 31, 2011 Dec. 31, 2010 bananas Mar. 31, 2011 too much malarky September 1, 1992 redundant Dec. 31, 2010  September 29, 1999  March 12 2004 ddfd  March.    13 2019 ddfd  Mac.    13 2019 ddfd'

and when I run it like this 当我这样运行时

for date in re.findall(dateExpr,newString):
print date

I get this 我明白了

Mar. 31, 2011
Dec. 31, 2010
Mar. 31, 2011
September 1, 1992
Dec. 31, 2010
September 29, 1999
March 12 2004
March.    13 2019
Mac.    13 2019    #here is my problem

In your months regex, you are using square brackets, giving something like [Jan|Feb|Mar|...] . 在您的正则表达式months ,您使用方括号,给出类似[Jan|Feb|Mar|...] That is wrong. 那是错的。 Square brackets are for character classes and match one of any character in the brackets, so this will match J or a or n or | 方括号是字符类并匹配括号任意一个字符,所以这将匹配Jan| or F , etc. Instead you want to use parentheses: F等。相反,您想使用括号:

months='(?:'+'|'.join(calendar.month_abbr[1:])+'|'+'|'.join(calendar.month_name[1:])+')'

You need the ?: because findall returns only captured groups, so we need to mark this group as noncapturing. 您需要使用?:因为findall仅返回捕获的组,因此我们需要将该组标记为非捕获。

You have the same problem later in your regex where you do [1|2] . 稍后在执行[1|2]正则表达式中,您会遇到相同的问题。 You want (?:1|2) , or just [12] . 您要(?:1|2)还是只想[12]

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM