简体   繁体   English

如何使用python和lxml从html属性获取未解析的实体

[英]how to get unresolved entities from html attributes using python and lxml

When parsing HTML with python/lxml, I would like to retrieve the actual attribute text for html elements but instead, I get the attribute text with resolved entities. 当使用python / lxml解析HTML时,我想检索html元素的实际属性文本,但是,我却获得了带有已解析实体的属性文本。 That is, if the actual attribute reads this & that 也就是说,如果实际属性读取this & that this & that , I get back this & that . this & that我回来this & that

Is there a way to get the unresolved attribute value? 有没有办法获取未解决的属性值? Here is some example code that shows my problem, using python2.7 and lxml 3.2.1 这是一些使用python2.7和lxml 3.2.1展示我的问题的示例代码

from lxml import etree
s = '<html><body><a alt="hi &amp; there">a link</a></body></html>'
parser = etree.HTMLParser()
tree = etree.fromstring(s, parser=parser)
anc = tree.xpath('//a')
a = anc[0]
a.get('alt')
'hi & there'

a.attrib.get('alt')
'hi & there'

etree.tostring(a)
'<a alt="hi &amp; there">a link</a>'

I want to get the actual string hi &amp; there 我想获取实际的字符串hi &amp; there hi &amp; there . hi &amp; there

Unescaped character is invalid in HTML, and HTML abstraction model ( lxml.etree in this case) only works with valid HTML. 未转义的字符在HTML中无效,并且HTML抽象模型(在这种情况下为lxml.etree )仅适用于有效的HTML。 So there is no notion of unescaped character after the source HTML loaded to the object model. 因此,在将源HTML加载到对象模型之后,没有转义字符的概念。

Given unescaped characters in HTML source, parser will either fails completely, or tries to fix the source automatically. 给定HTML源代码中未转义的字符,解析器将完全失败,或尝试自动修复源。 lxml.etree.HTMLParser seems to fall to the latter category. lxml.etree.HTMLParser似乎属于后者。 For demo : 对于演示:

s = '<div>hi & there</div>'
parser = etree.HTMLParser()
t = etree.fromstring(s, parser=parser)
print(etree.tostring(t.xpath('//div')[0]))
#the source is automatially escaped. output:
#<div>hi &amp; there</div>

And I believe , the HTML tree model doesn't retain information regarding the original HTML source, it retains the fixed-valid one instead. 而且我相信 ,HTML树模型不会保留有关原始HTML源的信息,而是保留固定有效的模型。 So at this point, we can only see that all characters are escaped. 因此,在这一点上,我们只能看到所有字符都已转义。

Having said that, how about using cgi.escape() to get escaped entities! 话虽如此,如何使用cgi.escape()获取转义的实体! :p :p

#..continuing the demo codes above:
print(t.xpath('//div')[0].text)
#hi & there
print(cgi.escape(t.xpath('//div')[0]).text)
#hi &amp; there

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM