简体   繁体   English

使用lxml和xpath解析xml文件时如何保持&

[英]how to keep & when parsing an xml file using lxml and xpath

I am trying to extract some information from an input xml file and print it into an output file by using lxml and xpath instructions. 我试图使用lxml和xpath指令从输入的xml文件中提取一些信息,并将其打印到输出文件中。 I am getting a problem when reading an xml tag like the following 读取类似以下内容的xml标记时出现问题

...
<editor> Barnes &amp; Nobel </editor>
...

In order to parse the xml file and print the editor content I use (there is always only one editor in the xml): 为了解析xml文件并打印我使用的编辑器内容(xml中始终只有一个编辑器):

parser = etree.XMLParser(encoding='utf-8')
docTree = etree.parse( io.BytesIO(open(inputXML, "r").read()), parser )
print docTree.xpath('//editor')[0].text

My problem is that the &amp; 我的问题是&amp; gets converted at some point into '&' , which messes up my further processing. 在某个时候被转换为'&' ,这弄乱了我的进一步处理。

How can I ensure that the &amp; 我如何确保&amp; symbol will not be "decoded"? 符号不会被“解码”吗?

I know this will sound presumptuous, but you want the data to be "&" . 我知道这听起来很冒昧,但是您希望数据为"&" That is the text content of the XML element. 那就是XML元素的文本内容。 If you have later processing that needs it as "&amp;" 如果您以后需要将其处理为"&amp;" , then you need a step that will XML- (or HTML-) encode it back to "&amp;" ,那么您需要执行将XML(或HTML)编码回"&amp;" ,

You cannot ask an XML parser to parse your document and not turn "&amp;" 您不能要求XML解析器解析您的文档,并且不能将"&amp;" into "&" . 进入"&" It won't do it. 它不会做。

I finally found the answer to my own question in the answer of How do I escape ampersands in XML so they are rendered as entities in HTML? 我终于在“ 如何逃离XML中的“&”符号以便它们在HTML中呈现为实体)的答案中找到了自己的问题的答案 In my code I have added an intermediate step to ensure that all & characters will remain the same at the output. 在我的代码中,我添加了一个中间步骤,以确保所有&字符在输出中保持不变。 This is 这是

parser = etree.XMLParser(encoding='utf-8')
xmlText = open(inputXML, "r").read().replace("&amp;", "&amp;amp;")
docTree = etree.parse( io.BytesIO(xmlText), parser )
print docTree.xpath('//editor')[0].text

In fact, just in case, I have applied the same recipe to other possible entities in XML as defined in http://en.wikipedia.org/wiki/List_of_XML_and_HTML_character_entity_references#Predefined%5Fentities%5Fin%5FXML 实际上,以防万一,我已将相同的食谱应用于http://en.wikipedia.org/wiki/List_of_XML_and_HTML_character_entity_references#Predefined%5Fentities%5Fin%5FXML中定义的XML中其他可能的实体

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM