[英]how to get unresolved entities from html attributes using python and lxml
When parsing HTML with python/lxml, I would like to retrieve the actual attribute text for html elements but instead, I get the attribute text with resolved entities. 当使用python / lxml解析HTML时,我想检索html元素的实际属性文本,但是,我却获得了带有已解析实体的属性文本。 That is, if the actual attribute reads
this & that
也就是说,如果实际属性读取
this & that
this & that
, I get back this & that
. this & that
我回来this & that
。
Is there a way to get the unresolved attribute value? 有没有办法获取未解决的属性值? Here is some example code that shows my problem, using python2.7 and lxml 3.2.1
这是一些使用python2.7和lxml 3.2.1展示我的问题的示例代码
from lxml import etree
s = '<html><body><a alt="hi & there">a link</a></body></html>'
parser = etree.HTMLParser()
tree = etree.fromstring(s, parser=parser)
anc = tree.xpath('//a')
a = anc[0]
a.get('alt')
'hi & there'
a.attrib.get('alt')
'hi & there'
etree.tostring(a)
'<a alt="hi & there">a link</a>'
I want to get the actual string hi & there
我想获取实际的字符串
hi & there
hi & there
. hi & there
。
Unescaped character is invalid in HTML, and HTML abstraction model ( lxml.etree
in this case) only works with valid HTML. 未转义的字符在HTML中无效,并且HTML抽象模型(在这种情况下为
lxml.etree
)仅适用于有效的HTML。 So there is no notion of unescaped character after the source HTML loaded to the object model. 因此,在将源HTML加载到对象模型之后,没有转义字符的概念。
Given unescaped characters in HTML source, parser will either fails completely, or tries to fix the source automatically. 给定HTML源代码中未转义的字符,解析器将完全失败,或尝试自动修复源。
lxml.etree.HTMLParser
seems to fall to the latter category. lxml.etree.HTMLParser
似乎属于后者。 For demo : 对于演示:
s = '<div>hi & there</div>'
parser = etree.HTMLParser()
t = etree.fromstring(s, parser=parser)
print(etree.tostring(t.xpath('//div')[0]))
#the source is automatially escaped. output:
#<div>hi & there</div>
And I believe , the HTML tree model doesn't retain information regarding the original HTML source, it retains the fixed-valid one instead. 而且我相信 ,HTML树模型不会保留有关原始HTML源的信息,而是保留固定有效的模型。 So at this point, we can only see that all characters are escaped.
因此,在这一点上,我们只能看到所有字符都已转义。
Having said that, how about using cgi.escape()
to get escaped entities! 话虽如此,如何使用
cgi.escape()
获取转义的实体! :p :p
#..continuing the demo codes above:
print(t.xpath('//div')[0].text)
#hi & there
print(cgi.escape(t.xpath('//div')[0]).text)
#hi & there
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.