如何使用python和lxml从html属性获取未解析的实体

Question

当使用python / lxml解析HTML时，我想检索html元素的实际属性文本，但是，我却获得了带有已解析实体的属性文本。 也就是说，如果实际属性读取this & that this & that我回来this & that 。

有没有办法获取未解决的属性值？ 这是一些使用python2.7和lxml 3.2.1展示我的问题的示例代码

from lxml import etree
s = '<html><body><a alt="hi &amp; there">a link</a></body></html>'
parser = etree.HTMLParser()
tree = etree.fromstring(s, parser=parser)
anc = tree.xpath('//a')
a = anc[0]
a.get('alt')
'hi & there'

a.attrib.get('alt')
'hi & there'

etree.tostring(a)
'<a alt="hi &amp; there">a link</a>'

我想获取实际的字符串hi & there hi & there 。

Answer 1

未转义的字符在HTML中无效，并且HTML抽象模型（在这种情况下为lxml.etree ）仅适用于有效的HTML。 因此，在将源HTML加载到对象模型之后，没有转义字符的概念。

给定HTML源代码中未转义的字符，解析器将完全失败，或尝试自动修复源。 lxml.etree.HTMLParser似乎属于后者。 对于演示：

s = '<div>hi & there</div>'
parser = etree.HTMLParser()
t = etree.fromstring(s, parser=parser)
print(etree.tostring(t.xpath('//div')[0]))
#the source is automatially escaped. output:
#<div>hi &amp; there</div>

而且我相信 ，HTML树模型不会保留有关原始HTML源的信息，而是保留固定有效的模型。 因此，在这一点上，我们只能看到所有字符都已转义。

话虽如此，如何使用cgi.escape()获取转义的实体！ ：p

#..continuing the demo codes above:
print(t.xpath('//div')[0].text)
#hi & there
print(cgi.escape(t.xpath('//div')[0]).text)
#hi &amp; there

如何使用python和lxml从html属性获取未解析的实体

问题描述

1 个解决方案

解决方案1
2 已采纳 2015-05-05 01:15:02

如何使用python和lxml从html属性获取未解析的实体

问题描述

1 个解决方案

解决方案1 2 已采纳 2015-05-05 01:15:02

解决方案1
2 已采纳 2015-05-05 01:15:02