[英]How to write regex in Python with (?(DEFINE))?
I would like to parse codetags in source files.我想解析源文件中的代码标签。 I wrote this regex that works fine with PCRE:我写了这个适用于 PCRE 的正则表达式:
(?<tag>(?&TAG)):\s*
(?<message>.*?)
(
<
(?<author>(?:\w{3}\s*,\s*)*\w{3})?\s*
(?<date>(?&DATE))?
(?<flags>(?&FLAGS))?
>
)?
$
(?(DEFINE)
(?<TAG>\b(NOTE|LEGACY|HACK|TODO|FIXME|XXX|BUG))
(?<DATE>\d{4}-\d{2}-\d{2})
(?<FLAGS>[pts]:\w+\b)
)
Unfortunately it seems Python doesn't understand the DEFINE ( https://regex101.com/r/qH1uG3/1#pcre )不幸的是,Python 似乎不理解 DEFINE ( https://regex101.com/r/qH1uG3/1#pcre )
What is the best workaround in Python? Python 中最好的解决方法是什么?
The way with the regex module:使用正则表达式模块的方式:
As explained in comments the regex module allows to reuse named subpatterns.正如评论中所解释的,正则表达式模块允许重用命名子模式。 Unfortunately there is no (?(DEFINE)...)
syntax like in Perl or PCRE.不幸的是,没有像 Perl 或 PCRE 那样的(?(DEFINE)...)
语法。
So the way is to use the same workaround than with Ruby language that consists to put a {0}
quantifier when you want to define a named subpattern:因此,方法是使用与 Ruby 语言相同的解决方法,该方法包括在您想要定义命名子模式时放置{0}
量词:
import regex
s = r'''
// NOTE: A small example
// HACK: Another example <ABC 2014-02-03>
// HACK: Another example <ABC,DEF 2014-02-03>
// HACK: Another example <ABC,DEF p:0>
'''
p = r'''
# subpattern definitions
(?<TAG> \b(?:NOTE|LEGACY|HACK|TODO|FIXME|XXX|BUG) ){0}
(?<DATE> \d{4}-\d{2}-\d{2} ){0}
(?<FLAGS> [pts]:\w+ ){0}
# main pattern
(?<tag> (?&TAG) ) : \s*
(?<message> (?>[^\s<]+[^\n\S]+)* [^\s<]+ )? \s* # to trim the message
<
(?<author> (?: \w{3} \s* , \s* )*+ \w{3} )? \s*
(?<date> (?&DATE) )?
(?<flags> (?&FLAGS) )?
>
$
'''
rgx = regex.compile(p, regex.VERBOSE | regex.MULTILINE)
for m in rgx.finditer(s):
print (m.group('tag'))
Note: the subpatterns can be defined at the end of the pattern too.注意:子模式也可以在模式的末尾定义。
(?P<tag>\b(?:NOTE|LEGACY|HACK|TODO|FIXME|XXX|BUG)):\s*
(?P<message>.*?)
(
<
(?P<author>(?:\w{3}\s*,\s*)*\w{3})?\s*
(?P<date>\d{4}-\d{2}-\d{2})?
(?P<flags>[pts]:\w+\b)?
>
)?
$
You can just replace tag definitions in place as a workaround.See demo.您可以将标签定义替换为一种变通方法。请参阅演示。
https://regex101.com/r/qH1uG3/2 https://regex101.com/r/qH1uG3/2
As a quick fix, place your define's in a dict:作为快速修复,将您的定义放在字典中:
defines = {
'TAG': r'\b(NOTE|LEGACY|HACK|TODO|FIXME|XXX|BUG)',
'DATE': r'\d{4}-\d{2}-\d{2}',
'FLAGS': r'[pts]:\w+\b'
}
and replace them in your regex:并在您的正则表达式中替换它们:
regex = re.sub(r'\(\?&(\w+)\)', lambda m: defines[m.group(1)], regex)
If you have recursive define's, wrap that in a loop:如果您有递归定义,请将其包装在一个循环中:
define = r'\(\?&(\w+)\)'
while re.search(define, regex):
regex = re.sub(define, lambda m: defines[m.group(1)], regex)
A not-so-quick fix is to write your own re parser-compiler - but that's almost definitely an overkill for the task at hand.一个不太快的解决方法是编写自己的重新解析器编译器 - 但这对于手头的任务几乎肯定是一种矫枉过正。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.