简体   繁体   English

正则表达式 | 多线故障 python

[英]Regex | multi-line failure python

So I have a few documents I'm extracting the date from, my regex expression being:所以我有一些文件要从中提取日期,我的正则表达式是:

query = """([0-9]{1,2})?\s{1,2}([jJ]anurary|[fF]eburary|[mM]arch|[aA]pril
    |[mM]ay|[jJ]une|[jJ]uly|[aA]ugust|[sS]eptember|[oO]ctober|[jJ]anuary
    |[nN]ovember|[dD]ecember|[jJ]an|[fF]eb|[mM]ar|[aA]pr|[aA]ug|[sS]ep|[sS]ept
    |[oO]ct|[nN]ov|[dD]ec|[fF]ebruary)\s{1,2}([0-9]{2,4})"""

OR或者

query = """([0-9]{1,2})?\s{1,2}([jJ]anurary|[fF]eburary|[mM]arch|[aA]pril|
    [mM]ay|[jJ]une|[jJ]uly|[aA]ugust|[sS]eptember|[oO]ctober|[jJ]anuary|
    [nN]ovember|[dD]ecember|[jJ]an|[fF]eb|[mM]ar|[aA]pr|[aA]ug|[sS]ep|[sS]ept|
    [oO]ct|[nN]ov|[dD]ec|[fF]ebruary)\s{1,2}([0-9]{2,4})"""

The only difference between the two is one has |'s at the beginning of new each line, and the other has it at the end of the new line.两者之间的唯一区别是一个在新行的开头有一个 |'s,另一个在新行的末尾有它。 These two match different things - specifically, with |这两个匹配不同的东西 - 具体来说,与 | at the end of the line I won't match May, but if its at the beginning of a line I won't match January (assuming the rest of the day & yr & spaces are correct - I literally just move the or position around and what I was just matching I no longer match & vice versa).在行尾我不会匹配 May,但如果它在行首我不会匹配 January(假设当天的 rest & yr & 空格是正确的 - 我实际上只是移动或 position而我刚刚匹配的我不再匹配,反之亦然)。 Am I doing something wrong somehow, is there a way around this, or is there correct way to do this instead?我是不是做错了什么,有没有办法解决这个问题,或者有正确的方法来代替吗? Obviously the goal is to match both.显然,目标是匹配两者。 If you want to try it out yourself, the cases I can easily replicate are '8 may 2018' and '25 january 2018'.如果您想自己尝试一下,我可以轻松复制的案例是“2018 年 5 月 8 日”和“2018 年 1 月 25 日”。

The rest of my code is just re.search(query, doc) (which is whats failing to match).我的代码的 rest 只是 re.search(query, doc) (这是什么不匹配)。

Note - python 3.6.8 regex==2018.1.10注意 - python 3.6.8 regex==2018.1.10

As a few people have mentioned in the comments, you should try re.X (or re.VERBOSE ) 正如一些人在评论中提到的那样,您应该尝试re.X (或re.VERBOSE

This will allow you to both put the regex on multiple lines, as well as include comments 这将使您既可以将正则表达式放在多行上,也可以包含注释

query = """
# Day
([0-9]{1,2})?
\s{1,2}
# Long month
([jJ]anurary|[fF]eburary|[mM]arch
|[aA]pril|[mM]ay|[jJ]une
|[jJ]uly|[aA]ugust|[sS]eptember
|[oO]ctober|[nN]ovember|[dD]ecember
# Short month
|[jJ]an|[fF]eb|[mM]ar|[aA]pr|[aA]ug
|[sS]ept?|[oO]ct|[nN]ov|[dD]ec)
\s{1,2}
# Year
([0-9]{2,4})"""

This can be useful for separating and documenting your regex into more manageable pieces. 这对于将正则表达式分离和记录成更易于管理的部分很有用。

Also, you probably want to compile your regex if you use it more than once. 另外,如果您多次使用正则表达式,则可能要编译它。 So you would use it like pattern = re.compile(query, re.X) or pattern = re.compile(query, re.VERBOSE) . 因此,您可以像pattern = re.compile(query, re.X)pattern = re.compile(query, re.VERBOSE)一样使用它。

When you enter a string with triple quotes, all characters within the triple quotes are recorded, including \\n . 输入带三引号的字符串时,将记录三引号内的所有字符, 包括 \\n This is what your query string really looks like: 这是您的查询字符串的真正样子:

>>> query = """([0-9]{1,2})?\s{1,2}([jJ]anurary|[fF]eburary|[mM]arch|[aA]pril|
... [mM]ay|[jJ]une|[jJ]uly|[aA]ugust|[sS]eptember|[oO]ctober|[jJ]anuary|
... [nN]ovember|[dD]ecember|[jJ]an|[fF]eb|[mM]ar|[aA]pr|[aA]ug|[sS]ep|[sS]ept|
... [oO]ct|[nN]ov|[dD]ec|[fF]ebruary)\s{1,2}([0-9]{2,4})"""
>>> query
'([0-9]{1,2})?\\s{1,2}([jJ]anurary|[fF]eburary|[mM]arch|[aA]pril|\n    [mM]ay|[jJ]une|[jJ]uly|[aA]ugust|[sS]eptember|[oO]ctober|[jJ]anuary|\n    [nN]ovember|[dD]ecember|[jJ]an|[fF]eb|[mM]ar|[aA]pr|[aA]ug|[sS]ep|[sS]ept|\n    [oO]ct|[nN]ov|[dD]ec|[fF]ebruary)\\s{1,2}([0-9]{2,4})'

Avoid this by using \\ line continuation to enter the string on multiple lines: 通过使用\\行继续在多行中输入字符串来避免这种情况:

query = r"([0-9]{1,2})?\s{1,2}([jJ]anurary|[fF]eburary|[mM]arch|[aA]pril|[mM]ay|[jJ]une|[jJ]uly|[aA]ugust|" \
        r"[sS]eptember|[oO]ctober|[jJ]anuary|[nN]ovember|[dD]ecember|[jJ]an|[fF]eb|[mM]ar|[aA]pr|[aA]ug|[sS]ep|" \
        r"[sS]ept|[oO]ct|[nN]ov|[dD]ec|[fF]ebruary)\s{1,2}([0-9]{2,4})"

You can also keep your triple quotes and suppress the newline with \\ (remember you can't indent the lines below the first because those spaces/tabs will be included in the string): 您还可以保留三引号,并用\\禁止换行(请注意,您不能缩进第一行以下的行,因为这些空格/制表符将包含在字符串中):

query = """([0-9]{1,2})?\s{1,2}([jJ]anurary|[fF]eburary|[mM]arch|[aA]pril|\
[mM]ay|[jJ]une|[jJ]uly|[aA]ugust|[sS]eptember|[oO]ctober|[jJ]anuary|\
[nN]ovember|[dD]ecember|[jJ]an|[fF]eb|[mM]ar|[aA]pr|[aA]ug|[sS]ep|[sS]ept|\
[oO]ct|[nN]ov|[dD]ec|[fF]ebruary)\s{1,2}([0-9]{2,4})"""

See also: Pythonic way to create a long multi-line string 另请参阅:用Python方式创建长的多行字符串

Because of not using single-line long regex, multi-line regex is great to do and the following link is wonderful to have multi-line regex in python.由于不使用单行长正则表达式,多行正则表达式非常有用,以下链接非常适合在 python 中使用多行正则表达式。

see Pythonic way to create a long multi-line string请参阅创建长多行字符串的 Pythonic 方法

like Ian mentioned.就像伊恩提到的。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM