Weird lxml behavior

Question

Consider the following snippet:

import lxml.html

html = '<div><br />Hello text</div>'
doc = lxml.html.fromstring(html)
text = doc.xpath('//text()')[0]
print lxml.html.tostring(text.getparent())
#prints <br>Hello text

I was expecting to see '<div><br />Hello text</div>' , because br can't have nested text and is "self-closed" (I mean /> ). How to make lxml handle it right?

Answer 1

HTML doesn't have self-closing tags. It is a xml thing.

import lxml.etree

html = '<div><br />Hello text</div>'
doc = lxml.etree.fromstring(html)
text = doc.xpath('//text()')[0]
print lxml.etree.tostring(text.getparent())

prints

<br/>Hello text

Note that the text is not inside the tag. lxml has a " tail " concept.

>>> print text.text
None
>>> print text.tail
Hello text

Answer 2

When you are dealing with valid XHTML you can use the etree instead of html.

import lxml.etree

html = '<div><br />Hello text</div>'
doc = lxml.etree.fromstring(html)
text = doc.xpath('//text()')[0]
print lxml.etree.tostring(text.getparent())

Fun thing, you can typically use this to convert HTML to XHTML:

import lxml.etree
import lxml.html

html = '<div><br>Hello text</div>'
doc = lxml.html.fromstring(html)
text = doc.xpath('//text()')[0]
print lxml.etree.tostring(text.getparent())

Output: "<br/>Hello text"

Weird lxml behavior

Question

2 answers

solution1
8 ACCPTED 2009-10-16 12:55:34

solution2
2 2009-10-16 12:59:20

Weird lxml behavior

Question

2 answers

solution1 8 ACCPTED 2009-10-16 12:55:34

solution2 2 2009-10-16 12:59:20

solution1
8 ACCPTED 2009-10-16 12:55:34

solution2
2 2009-10-16 12:59:20