如何使用python和lxml從html屬性獲取未解析的實體

Question

當使用python / lxml解析HTML時，我想檢索html元素的實際屬性文本，但是，我卻獲得了帶有已解析實體的屬性文本。 也就是說，如果實際屬性讀取this & that this & that我回來this & that 。

有沒有辦法獲取未解決的屬性值？ 這是一些使用python2.7和lxml 3.2.1展示我的問題的示例代碼

from lxml import etree
s = '<html><body><a alt="hi &amp; there">a link</a></body></html>'
parser = etree.HTMLParser()
tree = etree.fromstring(s, parser=parser)
anc = tree.xpath('//a')
a = anc[0]
a.get('alt')
'hi & there'

a.attrib.get('alt')
'hi & there'

etree.tostring(a)
'<a alt="hi &amp; there">a link</a>'

我想獲取實際的字符串hi & there hi & there 。

Answer 1

未轉義的字符在HTML中無效，並且HTML抽象模型（在這種情況下為lxml.etree ）僅適用於有效的HTML。 因此，在將源HTML加載到對象模型之后，沒有轉義字符的概念。

給定HTML源代碼中未轉義的字符，解析器將完全失敗，或嘗試自動修復源。 lxml.etree.HTMLParser似乎屬於后者。 對於演示：

s = '<div>hi & there</div>'
parser = etree.HTMLParser()
t = etree.fromstring(s, parser=parser)
print(etree.tostring(t.xpath('//div')[0]))
#the source is automatially escaped. output:
#<div>hi &amp; there</div>

而且我相信 ，HTML樹模型不會保留有關原始HTML源的信息，而是保留固定有效的模型。 因此，在這一點上，我們只能看到所有字符都已轉義。

話雖如此，如何使用cgi.escape()獲取轉義的實體！ ：p

#..continuing the demo codes above:
print(t.xpath('//div')[0].text)
#hi & there
print(cgi.escape(t.xpath('//div')[0]).text)
#hi &amp; there

如何使用python和lxml從html屬性獲取未解析的實體

問題描述

1 個解決方案

解決方案1
2 已采納 2015-05-05 01:15:02

如何使用python和lxml從html屬性獲取未解析的實體

問題描述

1 個解決方案

解決方案1 2 已采納 2015-05-05 01:15:02

解決方案1
2 已采納 2015-05-05 01:15:02