简体   繁体   中英

how to get unresolved entities from html attributes using python and lxml

When parsing HTML with python/lxml, I would like to retrieve the actual attribute text for html elements but instead, I get the attribute text with resolved entities. That is, if the actual attribute reads this & that this & that , I get back this & that .

Is there a way to get the unresolved attribute value? Here is some example code that shows my problem, using python2.7 and lxml 3.2.1

from lxml import etree
s = '<html><body><a alt="hi &amp; there">a link</a></body></html>'
parser = etree.HTMLParser()
tree = etree.fromstring(s, parser=parser)
anc = tree.xpath('//a')
a = anc[0]
a.get('alt')
'hi & there'

a.attrib.get('alt')
'hi & there'

etree.tostring(a)
'<a alt="hi &amp; there">a link</a>'

I want to get the actual string hi &amp; there hi &amp; there .

Unescaped character is invalid in HTML, and HTML abstraction model ( lxml.etree in this case) only works with valid HTML. So there is no notion of unescaped character after the source HTML loaded to the object model.

Given unescaped characters in HTML source, parser will either fails completely, or tries to fix the source automatically. lxml.etree.HTMLParser seems to fall to the latter category. For demo :

s = '<div>hi & there</div>'
parser = etree.HTMLParser()
t = etree.fromstring(s, parser=parser)
print(etree.tostring(t.xpath('//div')[0]))
#the source is automatially escaped. output:
#<div>hi &amp; there</div>

And I believe , the HTML tree model doesn't retain information regarding the original HTML source, it retains the fixed-valid one instead. So at this point, we can only see that all characters are escaped.

Having said that, how about using cgi.escape() to get escaped entities! :p

#..continuing the demo codes above:
print(t.xpath('//div')[0].text)
#hi & there
print(cgi.escape(t.xpath('//div')[0]).text)
#hi &amp; there

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM