[英]Use regex to find shortest possible match from multiline string
我有一个具有给定表示的多行字符串:
text1 (arbitrary chars and lines)\n
<hr>\n
Bitmap: ./media/logo.bmp\n
text2 (arbitrary chars and lines)\n
text3 (arbitrary chars and lines)\n
<hr>\n
Bitmap: ./media/logo.bmp\n
text2 (arbitrary chars lines)\n
\n
我想匹配这个在字符串中总是出现两次的子字符串(一次总是在最后):
<hr>\n
Bitmap: ./media/logo.bmp\n
text2 (arbitrary chars and lines)\n
当我尝试与re.search
匹配时,它返回长匹配:
regex = re.compile('<hr>\n'
'Bitmap: [\S\n ]*'
'$')
print(re.search(regex, string).group())
>> '<hr>\nBitmap: ./media/logo.bmp\ntext2 (arbitrary chars and lines)\ntext3 (arbitrary chars and lines)\n<hr>\nBitmap: ./media/logo.bmp\ntext2 (arbitrary chars and lines)\n\n'
是否可以使用regex
来查找短匹配?
解决方案:
使用 OR 运算符前瞻返回两个匹配项(一个较长,一个较短):
regex = re.compile('<hr>\n'
'Bitmap: [\S]*\n'
'[\s\S]*?(?=<hr>|\n\Z)')
print(re.findall(regex, string))
>> ['<hr>\nBitmap: ./media/logo.bmp\ntext2 (arbitrary chars and lines)\ntext3 (arbitrary chars and lines)\n', '<hr>\nBitmap: ./media/logo.bmp\ntext2 (arbitrary chars lines)\n']
用
(?m)^<hr>\r?\nBitmap:[\s\S]*?(?=^<hr>$|\Z)
见证明。
解释
--------------------------------------------------------------------------------
(?m) set flags for this block (with ^ and $
matching start and end of line) (case-
sensitive) (with . not matching \n)
(matching whitespace and # normally)
--------------------------------------------------------------------------------
^ the beginning of a "line"
--------------------------------------------------------------------------------
<hr> '<hr>'
--------------------------------------------------------------------------------
\r? '\r' (carriage return) (optional (matching
the most amount possible))
--------------------------------------------------------------------------------
\n '\n' (newline)
--------------------------------------------------------------------------------
Bitmap: 'Bitmap:'
--------------------------------------------------------------------------------
[\s\S]*? any character of: whitespace (\n, \r, \t,
\f, and " "), non-whitespace (all but \n,
\r, \t, \f, and " ") (0 or more times
(matching the least amount possible))
--------------------------------------------------------------------------------
(?= look ahead to see if there is:
--------------------------------------------------------------------------------
^ the beginning of a "line"
--------------------------------------------------------------------------------
<hr> '<hr>'
--------------------------------------------------------------------------------
$ before an optional \n, and the end of a
"line"
--------------------------------------------------------------------------------
| OR
--------------------------------------------------------------------------------
\Z the end of the string
--------------------------------------------------------------------------------
) end of look-ahead
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.