繁体   English   中英

如何让 lxml.Element.xpath 不展开实体?

[英]How can I get lxml.Element.xpath to not expand entities?

如果我使用lxml.etree.XMLParser(resolve_entities=False)来解析 XML 内容,它会正确返回文本节点而不展开实体。 (我希望它只是将实体的文本留在其中;相反,它会在第一个实体处截断。

from io import BytesIO
from lxml import etree

xml_content = b"""<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE foo [
  <!ELEMENT foo ANY>
  <!ENTITY bar "benign">
]>
<body>
    <expansion>    This is    a &bar; entity expansion with weird spacing.   </expansion>
</body>"""

nonexpanding_parser = etree.XMLParser(resolve_entities=False)
unexpanded_tree = etree.parse(BytesIO(xml_content), nonexpanding_parser)
elements = unexpanded_tree.xpath('//expansion')
elements[0].text  # '    This is    a '

但是,当我尝试在其上调用xpath function normalize-space时,它会扩展实体,这是我试图避免的:

elements[0].xpath('normalize-space(.)')  # 'This is a benign entity expansion with weird spacing.'

我想我可以编写自己的规范化方法,但我宁愿避免这种情况,而且我不能 100% 确定 function 的确切规格是什么,我正在努力在我的代码中替换它,所以我想要它表现得一样。

真正的问题是:我可以得到类似elements[0].xpath('normalize-space(.)')东西,它会返回This is a .

更好的是:

  • This is a entity expansion with weird spacing. (这是首选示例)
  • This is a &bar; entity expansion with weird spacing.

它不是很优雅,但是您可以提取所有文本节点,将它们加入 Python 并规范化空间。

(Pdb) text_fragments = unexpanded_tree.xpath('//expansion/text()')
(Pdb) text_fragments
['    This is    a ', ' entity expansion with weird spacing.   ']
(Pdb) ' '.join(''.join(text_fragments).split())
'This is a entity expansion with weird spacing.'

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM