[英]How do I split a string based on multiple delimiters including text within parentheses in Python3?
[英]How do I edit text between 2 string delimiters in Python when the string is many lines
我正在将 XML 文档转换为 .ckl 文档。 它们是类似的文件格式,但并没有那么简单。 我已经完成了大部分工作,但有一部分我被困住了。
在使用 ElementTree 解析 XML 之前,我必须转换一些<
和>
到<
和>
因为原始 XML 有一些错误,需要更正才能正确解析。 我没有意识到的一件事是,在某些组中,我需要离开<
和>
因为 .ckl 阅读器程序将该文本显示为<
和>
基本上,我矫枉过正以便能够解析,但当它们在<fixtext>
组中时需要改回一些。
为了进行初始更正,我将整个 XML 文件作为一个大字符串复制到一个变量中并执行data.replace('<', '<')
这工作正常并替换了所有所需的实例,但它也更正了我需要离开<
在此之后,我需要在解析之前将<fixtext>
组中的那几个案例改回来,否则一切都会搞砸
TL;DR 我需要</fixtext>
数发生变化的多行字符串中替换分隔符<fixtest *tags here*>
和</fixtext>
之间的<
和>
任何帮助,将不胜感激。 如果您需要更多信息,请告诉我,我很乐意回答任何问题
原始 XML 关闭的示例:
<description><VulnDiscussion>
这里,VulnDiscussion 应该是一个新标签
启动修复文本:
<fixtext fixref="F-22407r554595_fix">Configure the policy value for Computer Configuration >>
Administrative Templates >> Windows Components >> BitLocker Drive Encryption >>
Operating System Drives "Require additional authentication at startup" to "Enabled" with "Configure TPM
Startup PIN:" set to "Require startup PIN with TPM" or with "Configure TPM startup key and PIN:" set to
"Require startup key and PIN with TPM".
</fixtext>
使用正则表达式
import re
import html # In Python 3.2 a new html module was introduced, which is used for escaping reserved characters from HTML markup
# Example html Text with < and > between and outside tags
html_doc = '''>><<<>><<<blahblah<fixtext fixref="F-22407r554595_fix">Configure the policy value for Computer Configuration >>
Administrative Templates >> Windows Components >> BitLocker Drive Encryption >>
Operating System Drives "Require additional authentication at startup" to "Enabled" with "Configure TPM
Startup PIN:" set to "Require startup PIN with TPM" or with "Configure TPM startup key and PIN:" set to
"Require startup key and PIN with TPM".
</fixtext>>><<<blahblah'''
# Generate text with substitutions OP wants to reverse later on all the text
html_doc = html_doc.replace('>', '>').replace('<', '<')
# Regex pattern for detecting charcters between tags
p = re.compile(r"(?P<TAG_START><fixtext[^>]*>)(?P<TEXT>.*?)(?P<TAG_END></fixtext>)", flags = re.DOTALL)
indexes = p.groupindex # groupindex on a compiled regular expression which prints the groups and their orders in the pattern string
# i.e. mappingproxy({'TAG_START': 1, 'TEXT': 2, 'TAG_END': 3}
# Only escape characters between tags (DOTALL flag for multiline)
corrected = re.sub(pattern,
lambda m: m.group(indexes["TAG_START"]) + html.escape(m.group(indexes["TEXT"])) + m.group(indexes["TAG_END"]),
html_doc)
print(corrected)
注意已更正仅在标签之间替换了 < 和 >
>><<<>><<<blahblah<fixtext fixref="F-22407r554595_fix">Configure the policy value for Computer Configuration >>
Administrative Templates >> Windows Components >> BitLocker Drive Encryption >>
Operating System Drives "Require additional authentication at startup" to "Enabled" with "Configure TPM
Startup PIN:" set to "Require startup PIN with TPM" or with "Configure TPM startup key and PIN:" set to
"Require startup key and PIN with TPM".
</fixtext>>><<<blahblah
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.