如何让 lxml.Element.xpath 不展开实体？

Question

If I use lxml.etree.XMLParser(resolve_entities=False) to parse XML content, it correctly returns text nodes without the entities expanded.如果我使用lxml.etree.XMLParser(resolve_entities=False)来解析 XML 内容，它会正确返回文本节点而不展开实体。 (I'd prefer that it just leave the text of the entity in there; instead it truncates at the first entity. （我希望它只是将实体的文本留在其中；相反，它会在第一个实体处截断。

from io import BytesIO
from lxml import etree

xml_content = b"""<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE foo [
  <!ELEMENT foo ANY>
  <!ENTITY bar "benign">
]>
<body>
    <expansion>    This is    a &bar; entity expansion with weird spacing.   </expansion>
</body>"""

nonexpanding_parser = etree.XMLParser(resolve_entities=False)
unexpanded_tree = etree.parse(BytesIO(xml_content), nonexpanding_parser)
elements = unexpanded_tree.xpath('//expansion')
elements[0].text  # '    This is    a '

However, when I try to call the xpath function normalize-space on it, it expands the entity, which I'm trying to avoid:但是，当我尝试在其上调用xpath function normalize-space时，它会扩展实体，这是我试图避免的：

elements[0].xpath('normalize-space(.)')  # 'This is a benign entity expansion with weird spacing.'

I suppose I can write my own normalization method, but I'd rather avoid that and I'm not 100% sure what the exact spec of that function is, and I'm working on replacing it in my code, so I want it to behave the same.我想我可以编写自己的规范化方法，但我宁愿避免这种情况，而且我不能 100% 确定 function 的确切规格是什么，我正在努力在我的代码中替换它，所以我想要它表现得一样。

Really the question is : Can I get something like elements[0].xpath('normalize-space(.)') that will return This is a .真正的问题是：我可以得到类似elements[0].xpath('normalize-space(.)')东西，它会返回This is a .

Even better:更好的是：

This is a entity expansion with weird spacing. (this is the preferred example) （这是首选示例）
This is a &bar; entity expansion with weird spacing.

Answer 1

It's not exactly elegant, but you could extract all the text nodes, join them in Python, and normalize the space.它不是很优雅，但是您可以提取所有文本节点，将它们加入 Python 并规范化空间。

(Pdb) text_fragments = unexpanded_tree.xpath('//expansion/text()')
(Pdb) text_fragments
['    This is    a ', ' entity expansion with weird spacing.   ']
(Pdb) ' '.join(''.join(text_fragments).split())
'This is a entity expansion with weird spacing.'

如何让 lxml.Element.xpath 不展开实体？

问题描述

1 个解决方案

解决方案1
0 2022-01-19 01:23:30

如何让 lxml.Element.xpath 不展开实体？

问题描述

1 个解决方案

解决方案1 0 2022-01-19 01:23:30

解决方案1
0 2022-01-19 01:23:30