简体   繁体   English

奇怪的lxml行为

[英]Weird lxml behavior

Consider the following snippet: 考虑以下代码段:

import lxml.html

html = '<div><br />Hello text</div>'
doc = lxml.html.fromstring(html)
text = doc.xpath('//text()')[0]
print lxml.html.tostring(text.getparent())
#prints <br>Hello text

I was expecting to see '<div><br />Hello text</div>' , because br can't have nested text and is "self-closed" (I mean /> ). 我本来希望看到'<div><br />Hello text</div>' ,因为br不能有嵌套文本,并且是“自封闭的”(我的意思是/> )。 How to make lxml handle it right? 如何使lxml处理正确?

HTML doesn't have self-closing tags. HTML没有自动关闭标签。 It is a xml thing. 这是一个xml的东西。

import lxml.etree

html = '<div><br />Hello text</div>'
doc = lxml.etree.fromstring(html)
text = doc.xpath('//text()')[0]
print lxml.etree.tostring(text.getparent())

prints 版画

<br/>Hello text

Note that the text is not inside the tag. 请注意,文本不在标签内。 lxml has a " tail " concept. lxml有一个“ tail ”概念。

>>> print text.text
None
>>> print text.tail
Hello text

When you are dealing with valid XHTML you can use the etree instead of html. 处理有效的XHTML时,可以使用etree而不是html。

import lxml.etree

html = '<div><br />Hello text</div>'
doc = lxml.etree.fromstring(html)
text = doc.xpath('//text()')[0]
print lxml.etree.tostring(text.getparent())

Fun thing, you can typically use this to convert HTML to XHTML: 有趣的是,您通常可以使用它来将HTML转换为XHTML:

import lxml.etree
import lxml.html

html = '<div><br>Hello text</div>'
doc = lxml.html.fromstring(html)
text = doc.xpath('//text()')[0]
print lxml.etree.tostring(text.getparent())

Output: "<br/>Hello text" 输出: "<br/>Hello text"

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM