lxml / python使用CDATA部分读取xml

Question

In my xml I have a CDATA section. 在我的xml中，我有一个CDATA部分。 I want to keep the CDATA part, and then strip it. 我想保留CDATA部分，然后剥离它。 Can someone help with the following? 有人可以提供以下帮助吗？

Default does not work: 默认值不起作用：

$ from io import StringIO
$ from lxml import etree
$ xml = '<Subject> My Subject: 美海軍研究船勘查台海水文？ 船<![CDATA[&#xE9;]]>€ </Subject>'
$ tree = etree.parse(StringIO(xml))
$ tree.getroot().text
' My Subject: 美海軍研究船勘查台海水文？ 船&#xE9;€ '

This post seems to suggest that a parser option strip_cdata=False may keep the cdata, but it has no effect: 这篇文章似乎暗示parser选项strip_cdata=False可以保留cdata，但没有效果：

$ parser=etree.XMLParser(strip_cdata=False)
$ tree = etree.parse(StringIO(xml), parser=parser)
$ tree.getroot().text    
' My Subject: 美海軍研究船勘查台海水文？ 船&#xE9;€ '

Using strip_cdata=True , which should be the default, yields the same: 使用strip_cdata=True （应为默认值）产生相同的结果：

$ parser=etree.XMLParser(strip_cdata=True)
$ tree = etree.parse(StringIO(xml), parser=parser)    
$ tree.getroot().text    
' My Subject: 美海軍研究船勘查台海水文？ 船&#xE9;€ '

Answer 1

CDATA sections are not preserved in the text property of an element, even if strip_cdata=False is used when the XML content is parsed, as you have noticed. 您已经注意到，即使在解析XML内容时使用strip_cdata=False ，也不会在元素的text属性中保留CDATA节。 See https://lxml.de/api.html#cdata . 请参阅https://lxml.de/api.html#cdata 。

CDATA sections are preserved in these cases: 在以下情况下，将保留CDATA节：

When serializing with tostring() : 使用tostring()序列化时：

 print(etree.tostring(tree.getroot(), encoding="UTF-8").decode())

When writing to a file: 写入文件时：

 tree.write("subject.xml", encoding="UTF-8")

lxml / python使用CDATA部分读取xml

问题描述

1 个解决方案

解决方案1
1 已采纳 2018-11-24 07:02:00

lxml / python使用CDATA部分读取xml

问题描述

1 个解决方案

解决方案1 1 已采纳 2018-11-24 07:02:00

解决方案1
1 已采纳 2018-11-24 07:02:00