简体   繁体   English

Python lxml.etree 保留实体引用

[英]Python lxml.etree retain entity references

I'm creating a simple script to parse, validate, fix, and reprint XML files using a specific schema.我正在创建一个简单的脚本来使用特定模式解析、验证、修复和重新打印 XML 文件。 The whole thing works great, but the problem is that when I print the modified ElementTree, it erases all of my entity references.整个事情都很好,但问题是当我打印修改后的 ElementTree 时,它会删除我所有的实体引用。

Here's the simplified python code:这是简化的 python 代码:

from pathlib import Path

from lxml import etree as ET
from lxml.builder import E

schema = ET.XMLSchema('C:/path/to/schema.xsd')
parser = ET.XMLParser(recover=True)
source_file = Path('file.xml')
tree = ET.parse(source_file.name, parser, base_url="http://www.domain.url")
root = tree.getroot()

# Do some validation

source_file.write_text(ET.tostring(tree, encoding='utf-8').decode(encoding='utf-8'), encoding='utf-8') 

Here is a snippet of the 'before' XML:这是“之前”XML 的片段:

<!DOCTYPE element [
<!ENTITY % ISOEntities PUBLIC "ISO 8879-1986//ENTITIES ISO Character Entities 20030531//EN//XML" "http://www.domain.url/path/to/ISOEntities"> 
%ISOEntities
]>
<para>&minus;67 to 250&deg;</para>

And after:之后:

<!DOCTYPE element [
<!ENTITY % ISOEntities PUBLIC "ISO 8879-1986//ENTITIES ISO Character Entities 20030531//EN//XML" "http://www.domain.url/path/to/ISOEntities"> 
<! -- THE ENTIRE CONTENTS OF ISOENTITIES (100s of lines of code) -->
]>
<para>-67 to 250°</para>

While technically 'correct', I want to keep them as entity references instead of literal characters.虽然技术上“正确”,但我想将它们保留为实体引用而不是文字字符。 As noted, it also resolves ISOEntities , which I do not want.如前所述,它还解决了我不想要的ISOEntities

Now, the obvious solution I tried is to add the resolve_entities=False kwarg to the Parser.现在,我尝试的明显解决方案是将resolve_entities=False kwarg 添加到解析器。 The result is entirely removing the references and simply replacing them with nothing.结果是完全删除了引用并简单地将它们替换为空。

<!DOCTYPE element [
<!ENTITY % ISOEntities PUBLIC "ISO 8879-1986//ENTITIES ISO Character Entities 20030531//EN//XML" "http://www.domain.url/path/to/ISOEntities"> 
%ISOEntities
]>
<para>67 to 250</para>

Is there any way to print the tree to string as it was when it was parsed?有什么方法可以将树打印成字符串,就像它被解析时一样? (ie keeping the internalDTD the same and keeping the entity references intact as well) (即保持 internalDTD 相同并保持实体引用不变)

EDIT: Used a debugger to verify that the entities were missing prior to the tostring operation, so it's certainly the parsing process that's eliminating them, not the conversion to string.编辑:使用调试器在tostring操作之前验证实体是否丢失,因此肯定是解析过程消除了它们,而不是转换为字符串。

So I didn't find a good answer to this problem.所以我没有找到这个问题的好答案。 The entities are all declared in %IsoEntities, but because that is also itself an entity, and I set it to not resolve entities, the parser doesn't resolve %IsoEntities, and thus doesn't recognize any other entities either.实体都在 %IsoEntities 中声明,但因为它本身也是一个实体,并且我将其设置为不解析实体,所以解析器不解析 %IsoEntities,因此也不识别任何其他实体。

But I did find a workaround.但我确实找到了解决方法。 Turns out &amp;结果是&amp; doesn't get replaced, I guess because it's a unique case.不会被替换,我猜是因为它是一个独特的案例。 So the workaround is to replace all & with &amp;所以解决方法是将所有&替换为&amp; . . So you'd send up with something like &amp;minus;所以你会发送类似&amp;minus;的东西。 . . The parser doesn't recognize this as an entity, and will keep it as is.解析器不会将其识别为实体,并将保持原样。 Once the ElementTree is converted into string format, you can go through again and replace all &amp;一旦ElementTree被转换成字符串格式,就可以再次通过go并替换所有&amp; with & so you end up with your original entities again.&所以你最终再次得到你原来的实体。

I'd still love to hear if anyone has a better answer.我仍然很想听听是否有人有更好的答案。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM