Consider the following snippet:
import lxml.html
html = '<div><br />Hello text</div>'
doc = lxml.html.fromstring(html)
text = doc.xpath('//text()')[0]
print lxml.html.tostring(text.getparent())
#prints <br>Hello text
I was expecting to see '<div><br />Hello text</div>'
, because br
can't have nested text and is "self-closed" (I mean />
). How to make lxml
handle it right?
HTML doesn't have self-closing tags. It is a xml thing.
import lxml.etree
html = '<div><br />Hello text</div>'
doc = lxml.etree.fromstring(html)
text = doc.xpath('//text()')[0]
print lxml.etree.tostring(text.getparent())
prints
<br/>Hello text
Note that the text is not inside the tag. lxml
has a " tail
" concept.
>>> print text.text
None
>>> print text.tail
Hello text
When you are dealing with valid XHTML you can use the etree instead of html.
import lxml.etree
html = '<div><br />Hello text</div>'
doc = lxml.etree.fromstring(html)
text = doc.xpath('//text()')[0]
print lxml.etree.tostring(text.getparent())
Fun thing, you can typically use this to convert HTML to XHTML:
import lxml.etree
import lxml.html
html = '<div><br>Hello text</div>'
doc = lxml.html.fromstring(html)
text = doc.xpath('//text()')[0]
print lxml.etree.tostring(text.getparent())
Output: "<br/>Hello text"
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.