Python 正则表达式替换字符串中的子字符串

Question

I have a string like this:我有一个这样的字符串：

import re

text = """
Some stuff to keep <b>here</b>

CODE
<b>Replace gt and lt</b>
<i>inside <script>this</script> code</i>
CODE

Some more stuff to keep <b>here</b>
"""

And the expected output is:而预期的 output 是：

Some stuff to keep <b>here</b>

CODE
_LT_b_GT_Replace gt and lt_LT_/b_GT_
_LT_i_GT_inside _LT_script_GT_this_LT_/script_GT_ code_LT_/i_GT_
CODE

Some more stuff to keep <b>here</b>

Here's a small subset of what I've tried:这是我尝试过的一小部分：

# None of these work, and typically only replace the first or last occurence of <
re.sub(r'(?<=CODE)<(?=CODE)', r'_LT_', text, flags=re.DOTALL)
re.sub(r'(?<=CODE)(.*?)<(.*?)(?=CODE)', r'\1_LT_\2', text, flags=re.DOTALL)
re.sub(r'(?<=CODE)(.*?)[<]*(.*?)(?=CODE)', r'\1_LT_\2', text, flags=re.DOTALL|re.MULTILINE)
re.sub(r'(CODE.*?)<(.*?CODE)', r'\1_LT_\2', text, flags=re.DOTALL)
re.sub(r'(CODE.*)<(.*CODE)', r'\1_LT_\2', text, flags=re.DOTALL)

What I'd like to happen: All occurrences of < between CODE and CODE to be replaced with _LT_ .我想要发生的事情： CODE和CODE之间出现的所有<都将替换为_LT_ 。

After spending the day on stackoverflow and regex101.com, I'm starting to think either it's not possible or I'm not smart enough to handle this.在 stackoverflow 和 regex101.com 上度过了一天之后，我开始认为这是不可能的，或者我不够聪明来处理这个问题。

Any help is tremendously appreciated!非常感谢任何帮助！

Thanks in advance.提前致谢。

Answer 1

Here is my answer:这是我的答案：

text = """
Some stuff to keep <b>here</b>

CODE
<b>Replace gt and lt</b>
<i>inside <script>this</script> code</i>
CODE

Some more stuff to keep <b>here</b>
"""

output = ''
for i in range(len(text.split('CODE'))):
    if i % 2:
        output += text.split('CODE')[i].replace('>', '_GT_').replace('<', '_LT_')
    else:
        output += text.split('CODE')[i]


print(output)

With this solution, every code block is being formated and added to the output .使用此解决方案，每个代码块都被格式化并添加到output中。 This does not include regex but this works.这不包括regex ，但这有效。

Answer 2

With regex:使用正则表达式：

import re
text = "\nSome stuff to keep <b>here</b>\n\nCODE\n<b>Replace gt and lt</b>\n<i>inside <script>this</script> code</i>\nCODE\n\nSome more stuff to keep <b>here</b>\n"
pattern = r"(?s)CODE.*?CODE"
print(re.sub(pattern, lambda x: x.group().replace('<','_LT_').replace('>','_GT_'), text))

See Python proof .参见Python 证明。

Results :结果：

Some stuff to keep <b>here</b>

CODE
_LT_b_GT_Replace gt and lt_LT_/b_GT_
_LT_i_GT_inside _LT_script_GT_this_LT_/script_GT_ code_LT_/i_GT_
CODE

Some more stuff to keep <b>here</b>

See regex proof .请参阅正则表达式证明。

EXPLANATION解释

--------------------------------------------------------------------------------
  (?s)                     set flags for this block (with . matching
                           \n) (case-sensitive) (with ^ and $
                           matching normally) (matching whitespace
                           and # normally)
--------------------------------------------------------------------------------
  CODE                     'CODE'
--------------------------------------------------------------------------------
  .*?                      any character (0 or more times (matching
                           the least amount possible))
--------------------------------------------------------------------------------
  CODE                     'CODE'

Answer 3

I'll update this answer in a few minutes with an only-regex solution but, meanwhile... Is not doing a split and then join strings a solution?我将在几分钟内使用唯一的正则表达式解决方案更新此答案，但同时......不是进行拆分然后加入字符串解决方案吗？

re.sub(regex, value, text.split("CODE\n")[1], flags)

EDIT, I found the answer: but it's a little bit hacky You can read the full description in this post: https://stackoverflow.com/a/11096811/8665327编辑，我找到了答案：但它有点hacky你可以阅读这篇文章的完整描述： https://stackoverflow.com/a/11096811/8665327

Basically, the line you are looking for is this:基本上，您正在寻找的行是这样的：

text = re.sub('\nCODE\n[^(CODE)]*\nCODE\n', lambda x: x.group(0).replace('<', '_LT_').replace('>', '_GT_'), text)

This will work with the first set of text placed between "CODE" text in its own line as long as there is no "CODE" string between them这将适用于放置在“CODE”文本之间的第一组文本，只要它们之间没有“CODE”字符串

will_work = """
<title>This will work</title>
CODE
<b>Replace this</b>
CODE
"""

wont_work = """
CODE
<b>This won't work</b>CODE
CODE
"""

Python 正则表达式替换字符串中的子字符串

问题描述

3 个解决方案

解决方案1
3 已采纳 2021-06-06 17:05:06

解决方案2
2 2021-06-06 19:59:34

解决方案3
0 2021-06-06 16:44:37

Python 正则表达式替换字符串中的子字符串

问题描述

3 个解决方案

解决方案1 3 已采纳 2021-06-06 17:05:06

解决方案2 2 2021-06-06 19:59:34

解决方案3 0 2021-06-06 16:44:37

解决方案1
3 已采纳 2021-06-06 17:05:06

解决方案2
2 2021-06-06 19:59:34

解决方案3
0 2021-06-06 16:44:37