简体   繁体   English

使用正则表达式在每个匹配项中查找和替换任意数量的元素

[英]Using regex to find-and-replace an arbitrary number of elements per match

My goal is to recognize bold parenthesized text in a markup language, eg.:我的目标是识别标记语言中加粗括号的文本,例如:

[B] blah blah (foo) blah [/B]

and use regex to surround it with another tag, like so:并使用正则表达式用另一个标签包围它,如下所示:

[B] blah blah [C](foo)[/C] blah [/B]

Here's my attempt at this using Python:这是我使用 Python 进行的尝试:

outtext = re.sub(r'(\[B\].*?)(\(.*?\))(.*?\[/B\])', r'\1[C]\2[/C]\3', intext)

The problem is, it doesn't work if there are multiple parenthesized strings within the block:问题是,如果块中有多个带括号的字符串,它就不起作用:

Input: [B] (foo) (bar) [/B]
Expected: [B] [C](foo)[/C] [C](bar)[/C] [/B]
Actual: [B] [C](foo)[/C] (bar) [/B]

I know the reason why this is happening, but I don't know how to fix it.我知道发生这种情况的原因,但我不知道如何解决。 Is it possible to change my regex so that it's able to find-and-replace an arbitrary number of parenthesized strings within each block, as opposed to just one?是否可以更改我的正则表达式,以便它能够在每个块中查找和替换任意数量的带括号的字符串,而不是一个?

First I thought regex alone is not capable of solving the problem.首先,我认为仅靠正则表达式无法解决问题。 JvdV prooved this wrong, well done. JvdV 证明这是错误的,做得很好。 Honestly, I do not understand this regex anymore.老实说,我不再理解这个正则表达式了。

I solved it with some easier regex and a bit of python我用一些更简单的正则表达式和一些 python 解决了它

import re

intext = '[B] (foo) (bar) [/B] (not) [B] (this again) [/B]'

boldParts = re.findall(r'\[B\].*?\[/B\]', intext)
outtext = intext
for part in boldParts:
    replacement = re.sub(r'(\(.*?\))', r'[C]\1[/C]', part)
    outtext = outtext.replace(part, replacement)

print(outtext)

First I look for only the bold parts in the intext, then it's easy to replace the thing in parentheses.首先我只在文本中查找粗体部分,然后很容易替换括号中的内容。 And replace it in the outtext again.并再次在文本中替换它。

Admittedly not the shortest or most elegant way of doing it, but maybe a little more readable.不可否认,这不是最短或最优雅的方式,但可能更具可读性。

This kind of problem is typically resolved by replacing matches only inside other matches.此类问题通常通过仅替换其他匹配项中的匹配项来解决。 You need to run a re.sub with a regex that will match all B tagged substrings, and replace multiple occurrences of strings between parentheses only inside those matches using a callable in re.sub as a replacement argument.您需要使用正则表达式运行re.sub ,该正则表达式将匹配所有B标记的子字符串,并使用re.sub的可调用项作为替换参数,仅在这些匹配项内替换括号之间出现的多次字符串。

Here is the solution:这是解决方案:

import re
text = "[B] blah blah (foo) blah [/B]\n[B] (foo) (bar) [/B]"
print(re.sub(r'(?s)\[B].*?\[/B]', lambda x: re.sub(r'\([^()]*\)', r'[C]\g<0>[/C]', x.group()), text))

See the Python demo .请参阅Python 演示

NOTE : If you have longer texts, unroll the lazy dot pattern and use注意:如果您有更长的文本,请展开懒惰的点图案并使用

r'\[B][^[]*(?:\[(?!/?B])[^[]*)*\[/B]'

See this regex demo .请参阅此正则表达式演示

Output:输出:

[B] blah blah [C](foo)[/C] blah [/B]
[B] [C](foo)[/C] [C](bar)[/C] [/B]

The (?s)\\[B].*?\\[/B] pattern matches [B] , then 0+ chars as few as possible up to the leftmost [/B] (note (?s) allows the . to match any char including line break chars). (?s)\\[B].*?\\[/B]模式匹配[B] ,然后尽可能少的 0+ 个字符直到最左边的[/B] (注意(?s)允许.匹配任何字符,包括换行符)。 Then, once a match is found, it is passed to the callable, and the \\([^()]*\\) regex is run on that match.然后,一旦找到匹配项,就会将其传递给可调用对象,并在该匹配项上运行\\([^()]*\\)正则表达式。 \\([^()]*\\) matches any substring between the closest parentheses, ie ( , then 0+ chars other than ( and ) and then ) . \\([^()]*\\)匹配最接近括号之间的任何子字符串,即( ,然后是()之外的 0+ 个字符,然后是) The \\g<0> in the replacement pattern is a replacement backreference to the whole match.替换模式中的\\g<0>是对整个匹配项的替换反向引用。

Okay.. this took me some time.. I'm not sure of the specifics of the markup syntax, but I'll make some assumptions: the text inside parenthesis can be any character except parenthesis unless they are escaped.好的.. 这花了我一些时间.. 我不确定标记语法的细节,但我会做一些假设:括号内的文本可以是除括号之外的任何字符,除非它们被转义。 The escape character is a backslash.转义字符是反斜杠。 With that said.. here's what I came up with.话虽如此……这就是我想出的。

>>> expr = r"""
...    \(                            # Match left paren.
...       (
...         (?:      [^ \( \) \\] |  # Match any char not a paren or escape, OR
...              \\  [  \( \) \\] |  # Match an escaped paren or escape, OR
...              \s                  # whitespace.
...          )*
...        )
...    \)                            # Match right paren. """
...
>>> re.sub(expr, r"[C](\1)[/C]",  "[B] (foo) (bar) [/B]", flags=re.VERBOSE)
'[B] [C](foo)[/C] [C](bar)[/C] [/B]'

This will also work with target strings that have escaped parenthesis in them.这也适用于其中包含转义括号的目标字符串。 The abbreviated form of the above is this...上面的缩写形式是这个...

re.sub(r"\(((?:[^\(\)\\]|\\[\(\)\\]|\s)*)\)", 
       r"[C](\1)[/C]",  "[B] (foo) (bar) [/B]")

With the whitespace and comments taken out...随着空格和评论被删除......

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM