简体   繁体   English

如何让 lxml.Element.xpath 不展开实体?

[英]How can I get lxml.Element.xpath to not expand entities?

If I use lxml.etree.XMLParser(resolve_entities=False) to parse XML content, it correctly returns text nodes without the entities expanded.如果我使用lxml.etree.XMLParser(resolve_entities=False)来解析 XML 内容,它会正确返回文本节点而不展开实体。 (I'd prefer that it just leave the text of the entity in there; instead it truncates at the first entity. (我希望它只是将实体的文本留在其中;相反,它会在第一个实体处截断。

from io import BytesIO
from lxml import etree

xml_content = b"""<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE foo [
  <!ELEMENT foo ANY>
  <!ENTITY bar "benign">
]>
<body>
    <expansion>    This is    a &bar; entity expansion with weird spacing.   </expansion>
</body>"""

nonexpanding_parser = etree.XMLParser(resolve_entities=False)
unexpanded_tree = etree.parse(BytesIO(xml_content), nonexpanding_parser)
elements = unexpanded_tree.xpath('//expansion')
elements[0].text  # '    This is    a '

However, when I try to call the xpath function normalize-space on it, it expands the entity, which I'm trying to avoid:但是,当我尝试在其上调用xpath function normalize-space时,它会扩展实体,这是我试图避免的:

elements[0].xpath('normalize-space(.)')  # 'This is a benign entity expansion with weird spacing.'

I suppose I can write my own normalization method, but I'd rather avoid that and I'm not 100% sure what the exact spec of that function is, and I'm working on replacing it in my code, so I want it to behave the same.我想我可以编写自己的规范化方法,但我宁愿避免这种情况,而且我不能 100% 确定 function 的确切规格是什么,我正在努力在我的代码中替换它,所以我想要它表现得一样。

Really the question is : Can I get something like elements[0].xpath('normalize-space(.)') that will return This is a .真正的问题是:我可以得到类似elements[0].xpath('normalize-space(.)')东西,它会返回This is a .

Even better:更好的是:

  • This is a entity expansion with weird spacing. (this is the preferred example) (这是首选示例)
  • This is a &bar; entity expansion with weird spacing.

It's not exactly elegant, but you could extract all the text nodes, join them in Python, and normalize the space.它不是很优雅,但是您可以提取所有文本节点,将它们加入 Python 并规范化空间。

(Pdb) text_fragments = unexpanded_tree.xpath('//expansion/text()')
(Pdb) text_fragments
['    This is    a ', ' entity expansion with weird spacing.   ']
(Pdb) ' '.join(''.join(text_fragments).split())
'This is a entity expansion with weird spacing.'

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM