简体   繁体   English

如何使用 (?(DEFINE)) 在 Python 中编写正则表达式?

[英]How to write regex in Python with (?(DEFINE))?

I would like to parse codetags in source files.我想解析源文件中的代码标签 I wrote this regex that works fine with PCRE:我写了这个适用于 PCRE 的正则表达式:

(?<tag>(?&TAG)):\s*
(?<message>.*?)
(
<
   (?<author>(?:\w{3}\s*,\s*)*\w{3})?\s*
   (?<date>(?&DATE))?
   (?<flags>(?&FLAGS))?
>
)?
$

(?(DEFINE)
   (?<TAG>\b(NOTE|LEGACY|HACK|TODO|FIXME|XXX|BUG))
   (?<DATE>\d{4}-\d{2}-\d{2})
   (?<FLAGS>[pts]:\w+\b)
)

Unfortunately it seems Python doesn't understand the DEFINE ( https://regex101.com/r/qH1uG3/1#pcre )不幸的是,Python 似乎不理解 DEFINE ( https://regex101.com/r/qH1uG3/1#pcre )

What is the best workaround in Python? Python 中最好的解决方法是什么?

The way with the regex module:使用正则表达式模块的方式:

As explained in comments the regex module allows to reuse named subpatterns.正如评论中所解释的,正则表达式模块允许重用命名子模式。 Unfortunately there is no (?(DEFINE)...) syntax like in Perl or PCRE.不幸的是,没有像 Perl 或 PCRE 那样的(?(DEFINE)...)语法。

So the way is to use the same workaround than with Ruby language that consists to put a {0} quantifier when you want to define a named subpattern:因此,方法是使用与 Ruby 语言相同的解决方法,该方法包括在您想要定义命名子模式时放置{0}量词:

import regex

s = r'''
// NOTE: A small example
// HACK: Another example <ABC 2014-02-03>
// HACK: Another example <ABC,DEF 2014-02-03>
// HACK: Another example <ABC,DEF p:0>
'''

p = r'''
    # subpattern definitions
    (?<TAG> \b(?:NOTE|LEGACY|HACK|TODO|FIXME|XXX|BUG) ){0}
    (?<DATE> \d{4}-\d{2}-\d{2} ){0}
    (?<FLAGS> [pts]:\w+ ){0}

    # main pattern
    (?<tag> (?&TAG) ) : \s*
    (?<message> (?>[^\s<]+[^\n\S]+)* [^\s<]+ )? \s* # to trim the message
    <
    (?<author> (?: \w{3} \s* , \s* )*+ \w{3} )? \s*
    (?<date> (?&DATE) )?
    (?<flags> (?&FLAGS) )?
    >
    $
'''

rgx = regex.compile(p, regex.VERBOSE | regex.MULTILINE)

for m in rgx.finditer(s):
    print (m.group('tag'))

Note: the subpatterns can be defined at the end of the pattern too.注意:子模式也可以在模式的末尾定义。

(?P<tag>\b(?:NOTE|LEGACY|HACK|TODO|FIXME|XXX|BUG)):\s*
(?P<message>.*?)
(
<
   (?P<author>(?:\w{3}\s*,\s*)*\w{3})?\s*
   (?P<date>\d{4}-\d{2}-\d{2})?
   (?P<flags>[pts]:\w+\b)?
>
)?
$

You can just replace tag definitions in place as a workaround.See demo.您可以将标签定义替换为一种变通方法。请参阅演示。

https://regex101.com/r/qH1uG3/2 https://regex101.com/r/qH1uG3/2

As a quick fix, place your define's in a dict:作为快速修复,将您的定义放在字典中:

defines = {
    'TAG': r'\b(NOTE|LEGACY|HACK|TODO|FIXME|XXX|BUG)',
    'DATE': r'\d{4}-\d{2}-\d{2}',
    'FLAGS': r'[pts]:\w+\b'
}

and replace them in your regex:并在您的正则表达式中替换它们:

regex = re.sub(r'\(\?&(\w+)\)', lambda m: defines[m.group(1)], regex)

If you have recursive define's, wrap that in a loop:如果您有递归定义,请将其包装在一个循环中:

define = r'\(\?&(\w+)\)'
while re.search(define, regex):
    regex = re.sub(define, lambda m: defines[m.group(1)], regex)

A not-so-quick fix is to write your own re parser-compiler - but that's almost definitely an overkill for the task at hand.一个不太快的解决方法是编写自己的重新解析器编译器 - 但这对于手头的任务几乎肯定是一种矫枉过正。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM