简体   繁体   English

Python 正则表达式替换字符串中的子字符串

[英]Python regex replace substrings inside strings

I have a string like this:我有一个这样的字符串:

import re

text = """
Some stuff to keep <b>here</b>

CODE
<b>Replace gt and lt</b>
<i>inside <script>this</script> code</i>
CODE

Some more stuff to keep <b>here</b>
"""

And the expected output is:而预期的 output 是:

Some stuff to keep <b>here</b>

CODE
_LT_b_GT_Replace gt and lt_LT_/b_GT_
_LT_i_GT_inside _LT_script_GT_this_LT_/script_GT_ code_LT_/i_GT_
CODE

Some more stuff to keep <b>here</b>

Here's a small subset of what I've tried:这是我尝试过的一小部分:

# None of these work, and typically only replace the first or last occurence of <
re.sub(r'(?<=CODE)<(?=CODE)', r'_LT_', text, flags=re.DOTALL)
re.sub(r'(?<=CODE)(.*?)<(.*?)(?=CODE)', r'\1_LT_\2', text, flags=re.DOTALL)
re.sub(r'(?<=CODE)(.*?)[<]*(.*?)(?=CODE)', r'\1_LT_\2', text, flags=re.DOTALL|re.MULTILINE)
re.sub(r'(CODE.*?)<(.*?CODE)', r'\1_LT_\2', text, flags=re.DOTALL)
re.sub(r'(CODE.*)<(.*CODE)', r'\1_LT_\2', text, flags=re.DOTALL)

What I'd like to happen: All occurrences of < between CODE and CODE to be replaced with _LT_ .我想要发生的事情: CODECODE之间出现的所有<都将替换为_LT_

After spending the day on stackoverflow and regex101.com, I'm starting to think either it's not possible or I'm not smart enough to handle this.在 stackoverflow 和 regex101.com 上度过了一天之后,我开始认为这是不可能的,或者我不够聪明来处理这个问题。

Any help is tremendously appreciated!非常感谢任何帮助!

Thanks in advance.提前致谢。

Here is my answer:这是我的答案:

text = """
Some stuff to keep <b>here</b>

CODE
<b>Replace gt and lt</b>
<i>inside <script>this</script> code</i>
CODE

Some more stuff to keep <b>here</b>
"""

output = ''
for i in range(len(text.split('CODE'))):
    if i % 2:
        output += text.split('CODE')[i].replace('>', '_GT_').replace('<', '_LT_')
    else:
        output += text.split('CODE')[i]


print(output)

With this solution, every code block is being formated and added to the output .使用此解决方案,每个代码块都被格式化并添加到output中。 This does not include regex but this works.这不包括regex ,但这有效。

With regex:使用正则表达式:

import re
text = "\nSome stuff to keep <b>here</b>\n\nCODE\n<b>Replace gt and lt</b>\n<i>inside <script>this</script> code</i>\nCODE\n\nSome more stuff to keep <b>here</b>\n"
pattern = r"(?s)CODE.*?CODE"
print(re.sub(pattern, lambda x: x.group().replace('<','_LT_').replace('>','_GT_'), text))

See Python proof .参见Python 证明

Results :结果

Some stuff to keep <b>here</b>

CODE
_LT_b_GT_Replace gt and lt_LT_/b_GT_
_LT_i_GT_inside _LT_script_GT_this_LT_/script_GT_ code_LT_/i_GT_
CODE

Some more stuff to keep <b>here</b>

See regex proof .请参阅正则表达式证明

EXPLANATION解释

--------------------------------------------------------------------------------
  (?s)                     set flags for this block (with . matching
                           \n) (case-sensitive) (with ^ and $
                           matching normally) (matching whitespace
                           and # normally)
--------------------------------------------------------------------------------
  CODE                     'CODE'
--------------------------------------------------------------------------------
  .*?                      any character (0 or more times (matching
                           the least amount possible))
--------------------------------------------------------------------------------
  CODE                     'CODE'

I'll update this answer in a few minutes with an only-regex solution but, meanwhile... Is not doing a split and then join strings a solution?我将在几分钟内使用唯一的正则表达式解决方案更新此答案,但同时......不是进行拆分然后加入字符串解决方案吗?

re.sub(regex, value, text.split("CODE\n")[1], flags)

EDIT, I found the answer: but it's a little bit hacky You can read the full description in this post: https://stackoverflow.com/a/11096811/8665327编辑,我找到了答案:但它有点hacky你可以阅读这篇文章的完整描述: https://stackoverflow.com/a/11096811/8665327

Basically, the line you are looking for is this:基本上,您正在寻找的行是这样的:

text = re.sub('\nCODE\n[^(CODE)]*\nCODE\n', lambda x: x.group(0).replace('<', '_LT_').replace('>', '_GT_'), text)

This will work with the first set of text placed between "CODE" text in its own line as long as there is no "CODE" string between them这将适用于放置在“CODE”文本之间的第一组文本,只要它们之间没有“CODE”字符串

will_work = """
<title>This will work</title>
CODE
<b>Replace this</b>
CODE
"""

wont_work = """
CODE
<b>This won't work</b>CODE
CODE
"""

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM