[英]lxml/python reading xml with CDATA section
In my xml I have a CDATA
section. 在我的xml中,我有一个
CDATA
部分。 I want to keep the CDATA part, and then strip it. 我想保留CDATA部分,然后剥离它。 Can someone help with the following?
有人可以提供以下帮助吗?
Default does not work: 默认值不起作用:
$ from io import StringIO
$ from lxml import etree
$ xml = '<Subject> My Subject: 美海軍研究船勘查台海水文? 船<![CDATA[é]]>€ </Subject>'
$ tree = etree.parse(StringIO(xml))
$ tree.getroot().text
' My Subject: 美海軍研究船勘查台海水文? 船é€ '
This post seems to suggest that a parser
option strip_cdata=False
may keep the cdata, but it has no effect: 这篇文章似乎暗示
parser
选项strip_cdata=False
可以保留cdata,但没有效果:
$ parser=etree.XMLParser(strip_cdata=False)
$ tree = etree.parse(StringIO(xml), parser=parser)
$ tree.getroot().text
' My Subject: 美海軍研究船勘查台海水文? 船é€ '
Using strip_cdata=True
, which should be the default, yields the same: 使用
strip_cdata=True
(应为默认值)产生相同的结果:
$ parser=etree.XMLParser(strip_cdata=True)
$ tree = etree.parse(StringIO(xml), parser=parser)
$ tree.getroot().text
' My Subject: 美海軍研究船勘查台海水文? 船é€ '
CDATA sections are not preserved in the text
property of an element, even if strip_cdata=False
is used when the XML content is parsed, as you have noticed. 您已经注意到,即使在解析XML内容时使用
strip_cdata=False
,也不会在元素的text
属性中保留CDATA节。 See https://lxml.de/api.html#cdata . 请参阅https://lxml.de/api.html#cdata 。
CDATA sections are preserved in these cases: 在以下情况下, 将保留CDATA节:
When serializing with tostring()
: 使用
tostring()
序列化时:
print(etree.tostring(tree.getroot(), encoding="UTF-8").decode())
When writing to a file: 写入文件时:
tree.write("subject.xml", encoding="UTF-8")
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.