奇怪的lxml行为

Question

Consider the following snippet: 考虑以下代码段：

import lxml.html

html = '<div><br />Hello text</div>'
doc = lxml.html.fromstring(html)
text = doc.xpath('//text()')[0]
print lxml.html.tostring(text.getparent())
#prints <br>Hello text

I was expecting to see '<div><br />Hello text</div>' , because br can't have nested text and is "self-closed" (I mean /> ). 我本来希望看到'<div><br />Hello text</div>' ，因为br不能有嵌套文本，并且是“自封闭的”（我的意思是/> ）。 How to make lxml handle it right? 如何使lxml处理正确？

Answer 1

HTML doesn't have self-closing tags. HTML没有自动关闭标签。 It is a xml thing. 这是一个xml的东西。

import lxml.etree

html = '<div><br />Hello text</div>'
doc = lxml.etree.fromstring(html)
text = doc.xpath('//text()')[0]
print lxml.etree.tostring(text.getparent())

prints 版画

<br/>Hello text

Note that the text is not inside the tag. 请注意，文本不在标签内。 lxml has a " tail " concept. lxml有一个“ tail ”概念。

>>> print text.text
None
>>> print text.tail
Hello text

Answer 2

When you are dealing with valid XHTML you can use the etree instead of html. 处理有效的XHTML时，可以使用etree而不是html。

import lxml.etree

html = '<div><br />Hello text</div>'
doc = lxml.etree.fromstring(html)
text = doc.xpath('//text()')[0]
print lxml.etree.tostring(text.getparent())

Fun thing, you can typically use this to convert HTML to XHTML: 有趣的是，您通常可以使用它来将HTML转换为XHTML：

import lxml.etree
import lxml.html

html = '<div><br>Hello text</div>'
doc = lxml.html.fromstring(html)
text = doc.xpath('//text()')[0]
print lxml.etree.tostring(text.getparent())

Output: "<br/>Hello text" 输出： "<br/>Hello text"

奇怪的lxml行为

问题描述

2 个解决方案

解决方案1
8 已采纳 2009-10-16 12:55:34

解决方案2
2 2009-10-16 12:59:20

奇怪的lxml行为

问题描述

2 个解决方案

解决方案1 8 已采纳 2009-10-16 12:55:34

解决方案2 2 2009-10-16 12:59:20

解决方案1
8 已采纳 2009-10-16 12:55:34

解决方案2
2 2009-10-16 12:59:20