使用正则表达式从多行字符串中查找最短的匹配项

Question

我有一个具有给定表示的多行字符串：

text1 (arbitrary chars and lines)\n
<hr>\n
Bitmap: ./media/logo.bmp\n
text2 (arbitrary chars and lines)\n
text3 (arbitrary chars and lines)\n
<hr>\n
Bitmap: ./media/logo.bmp\n
text2 (arbitrary chars lines)\n
\n

我想匹配这个在字符串中总是出现两次的子字符串（一次总是在最后）：

<hr>\n
Bitmap: ./media/logo.bmp\n
text2 (arbitrary chars and lines)\n

当我尝试与re.search匹配时，它返回长匹配：

regex = re.compile('<hr>\n'
                   'Bitmap: [\S\n ]*'
                   '$')
print(re.search(regex, string).group())

>> '<hr>\nBitmap: ./media/logo.bmp\ntext2 (arbitrary chars and lines)\ntext3 (arbitrary chars and lines)\n<hr>\nBitmap: ./media/logo.bmp\ntext2 (arbitrary chars and lines)\n\n'

是否可以使用regex来查找短匹配？

解决方案：
使用 OR 运算符前瞻返回两个匹配项（一个较长，一个较短）：

regex = re.compile('<hr>\n'
                   'Bitmap: [\S]*\n'
                   '[\s\S]*?(?=<hr>|\n\Z)')
print(re.findall(regex, string))
>> ['<hr>\nBitmap: ./media/logo.bmp\ntext2 (arbitrary chars and lines)\ntext3 (arbitrary chars and lines)\n', '<hr>\nBitmap: ./media/logo.bmp\ntext2 (arbitrary chars lines)\n']

Answer 1

用

(?m)^<hr>\r?\nBitmap:[\s\S]*?(?=^<hr>$|\Z)

见证明。

解释

--------------------------------------------------------------------------------
  (?m)                     set flags for this block (with ^ and $
                           matching start and end of line) (case-
                           sensitive) (with . not matching \n)
                           (matching whitespace and # normally)
--------------------------------------------------------------------------------
  ^                        the beginning of a "line"
--------------------------------------------------------------------------------
  <hr>                     '<hr>'
--------------------------------------------------------------------------------
  \r?                      '\r' (carriage return) (optional (matching
                           the most amount possible))
--------------------------------------------------------------------------------
  \n                       '\n' (newline)
--------------------------------------------------------------------------------
  Bitmap:                  'Bitmap:'
--------------------------------------------------------------------------------
  [\s\S]*?                 any character of: whitespace (\n, \r, \t,
                           \f, and " "), non-whitespace (all but \n,
                           \r, \t, \f, and " ") (0 or more times
                           (matching the least amount possible))
--------------------------------------------------------------------------------
  (?=                      look ahead to see if there is:
--------------------------------------------------------------------------------
    ^                        the beginning of a "line"
--------------------------------------------------------------------------------
    <hr>                     '<hr>'
--------------------------------------------------------------------------------
    $                        before an optional \n, and the end of a
                             "line"
--------------------------------------------------------------------------------
   |                        OR
--------------------------------------------------------------------------------
    \Z                       the end of the string
--------------------------------------------------------------------------------
  )                        end of look-ahead

Answer 2

这有效： <hr>\\nBitmap:.*\\n(?:.*\\n){1,2}

参见： https : //regex101.com/r/i64K0W/3

您的正则表达式中的问题是* ，这是贪婪的。

使用正则表达式从多行字符串中查找最短的匹配项

问题描述

2 个解决方案

解决方案1
2 已采纳 2020-10-14 20:48:17

解决方案2
0 2020-10-14 13:44:32

使用正则表达式从多行字符串中查找最短的匹配项

问题描述

2 个解决方案

解决方案1 2 已采纳 2020-10-14 20:48:17

解决方案2 0 2020-10-14 13:44:32

解决方案1
2 已采纳 2020-10-14 20:48:17

解决方案2
0 2020-10-14 13:44:32