how to keep & when parsing an xml file using lxml and xpath

Question

I am trying to extract some information from an input xml file and print it into an output file by using lxml and xpath instructions. I am getting a problem when reading an xml tag like the following

...
<editor> Barnes &amp; Nobel </editor>
...

In order to parse the xml file and print the editor content I use (there is always only one editor in the xml):

parser = etree.XMLParser(encoding='utf-8')
docTree = etree.parse( io.BytesIO(open(inputXML, "r").read()), parser )
print docTree.xpath('//editor')[0].text

My problem is that the & gets converted at some point into '&' , which messes up my further processing.

How can I ensure that the & symbol will not be "decoded"?

Answer 1

I know this will sound presumptuous, but you want the data to be "&" . That is the text content of the XML element. If you have later processing that needs it as "&" , then you need a step that will XML- (or HTML-) encode it back to "&" ,

You cannot ask an XML parser to parse your document and not turn "&" into "&" . It won't do it.

Answer 2

I finally found the answer to my own question in the answer of How do I escape ampersands in XML so they are rendered as entities in HTML? In my code I have added an intermediate step to ensure that all & characters will remain the same at the output. This is

parser = etree.XMLParser(encoding='utf-8')
xmlText = open(inputXML, "r").read().replace("&amp;", "&amp;amp;")
docTree = etree.parse( io.BytesIO(xmlText), parser )
print docTree.xpath('//editor')[0].text

In fact, just in case, I have applied the same recipe to other possible entities in XML as defined in http://en.wikipedia.org/wiki/List_of_XML_and_HTML_character_entity_references#Predefined%5Fentities%5Fin%5FXML

how to keep & when parsing an xml file using lxml and xpath

Question

2 answers

solution1
1 2014-11-29 22:53:30

solution2
-1 2014-11-29 23:25:21

how to keep &amp; when parsing an xml file using lxml and xpath

Question

2 answers

solution1 1 2014-11-29 22:53:30

solution2 -1 2014-11-29 23:25:21

how to keep & when parsing an xml file using lxml and xpath

solution1
1 2014-11-29 22:53:30

solution2
-1 2014-11-29 23:25:21