简体   繁体   中英

how to keep & when parsing an xml file using lxml and xpath

I am trying to extract some information from an input xml file and print it into an output file by using lxml and xpath instructions. I am getting a problem when reading an xml tag like the following

...
<editor> Barnes &amp; Nobel </editor>
...

In order to parse the xml file and print the editor content I use (there is always only one editor in the xml):

parser = etree.XMLParser(encoding='utf-8')
docTree = etree.parse( io.BytesIO(open(inputXML, "r").read()), parser )
print docTree.xpath('//editor')[0].text

My problem is that the &amp; gets converted at some point into '&' , which messes up my further processing.

How can I ensure that the &amp; symbol will not be "decoded"?

I know this will sound presumptuous, but you want the data to be "&" . That is the text content of the XML element. If you have later processing that needs it as "&amp;" , then you need a step that will XML- (or HTML-) encode it back to "&amp;" ,

You cannot ask an XML parser to parse your document and not turn "&amp;" into "&" . It won't do it.

I finally found the answer to my own question in the answer of How do I escape ampersands in XML so they are rendered as entities in HTML? In my code I have added an intermediate step to ensure that all & characters will remain the same at the output. This is

parser = etree.XMLParser(encoding='utf-8')
xmlText = open(inputXML, "r").read().replace("&amp;", "&amp;amp;")
docTree = etree.parse( io.BytesIO(xmlText), parser )
print docTree.xpath('//editor')[0].text

In fact, just in case, I have applied the same recipe to other possible entities in XML as defined in http://en.wikipedia.org/wiki/List_of_XML_and_HTML_character_entity_references#Predefined%5Fentities%5Fin%5FXML

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM