Python lxml.etree retain entity references

Question

I'm creating a simple script to parse, validate, fix, and reprint XML files using a specific schema. The whole thing works great, but the problem is that when I print the modified ElementTree, it erases all of my entity references.

Here's the simplified python code:

from pathlib import Path

from lxml import etree as ET
from lxml.builder import E

schema = ET.XMLSchema('C:/path/to/schema.xsd')
parser = ET.XMLParser(recover=True)
source_file = Path('file.xml')
tree = ET.parse(source_file.name, parser, base_url="http://www.domain.url")
root = tree.getroot()

# Do some validation

source_file.write_text(ET.tostring(tree, encoding='utf-8').decode(encoding='utf-8'), encoding='utf-8')

Here is a snippet of the 'before' XML:

<!DOCTYPE element [
<!ENTITY % ISOEntities PUBLIC "ISO 8879-1986//ENTITIES ISO Character Entities 20030531//EN//XML" "http://www.domain.url/path/to/ISOEntities"> 
%ISOEntities
]>
<para>&minus;67 to 250&deg;</para>

And after:

<!DOCTYPE element [
<!ENTITY % ISOEntities PUBLIC "ISO 8879-1986//ENTITIES ISO Character Entities 20030531//EN//XML" "http://www.domain.url/path/to/ISOEntities"> 
<! -- THE ENTIRE CONTENTS OF ISOENTITIES (100s of lines of code) -->
]>
<para>-67 to 250°</para>

While technically 'correct', I want to keep them as entity references instead of literal characters. As noted, it also resolves ISOEntities , which I do not want.

Now, the obvious solution I tried is to add the resolve_entities=False kwarg to the Parser. The result is entirely removing the references and simply replacing them with nothing.

<!DOCTYPE element [
<!ENTITY % ISOEntities PUBLIC "ISO 8879-1986//ENTITIES ISO Character Entities 20030531//EN//XML" "http://www.domain.url/path/to/ISOEntities"> 
%ISOEntities
]>
<para>67 to 250</para>

Is there any way to print the tree to string as it was when it was parsed? (ie keeping the internalDTD the same and keeping the entity references intact as well)

EDIT: Used a debugger to verify that the entities were missing prior to the tostring operation, so it's certainly the parsing process that's eliminating them, not the conversion to string.

Answer 1

So I didn't find a good answer to this problem. The entities are all declared in %IsoEntities, but because that is also itself an entity, and I set it to not resolve entities, the parser doesn't resolve %IsoEntities, and thus doesn't recognize any other entities either.

But I did find a workaround. Turns out & doesn't get replaced, I guess because it's a unique case. So the workaround is to replace all & with & . So you'd send up with something like &minus; . The parser doesn't recognize this as an entity, and will keep it as is. Once the ElementTree is converted into string format, you can go through again and replace all & with & so you end up with your original entities again.

I'd still love to hear if anyone has a better answer.

Python lxml.etree retain entity references

Question

1 answers

solution1
0 ACCPTED 2019-11-15 13:25:31

Python lxml.etree retain entity references

Question

1 answers

solution1 0 ACCPTED 2019-11-15 13:25:31

solution1
0 ACCPTED 2019-11-15 13:25:31