简体   繁体   English

lxml使用元素移动文本

[英]lxml moves text with element

I have an issue with wrapping image with a div. 我有一个用div包装图像的问题。

from lxml.html import fromstring
from lxml import etree

tree = fromstring('<img src="/img.png"/> some text')
div = etree.Element('div')
div.insert(0, tree.find('img'))
tree.insert(0, div)
print etree.tostring(tree)

<span><div><img src="/img.png"/> some text</div></span>

Why does it add a span and how can I make it wrap image without text? 为什么它会添加一个跨度,如何在没有文本的情况下使其包装图像?

Because lxml is acutally an xml parser. 因为lxml实际上是一个xml解析器。 It has some forgiving parsing rules that allows it to parse html (the lxml.html part), but it will internally always build a valid tree. 它有一些lxml.html解析规则,允许它解析html( lxml.html部分),但它将在内部始终构建一个有效的树。

'<img src="/img.png"/> some text' isn't a tree, as it has no single root element, there is a img element, and a text node. '<img src="/img.png"/> some text'不是树,因为它没有单个根元素,有一个img元素和一个文本节点。 To be able to store this snipplet internally, lxml needs to wrap it in a suitable tag. 为了能够在内部存储此snipplet,lxml需要将其包装在合适的标记中。 If you give it a string alone, it will wrap it in a p tag. 如果你单独给它一个字符串,它会将它包装在一个p标签中。 Earlier versions just wrapped everything in html tags, which can lead to even more confusion. 早期的版本只是将所有内容都包装在html标签中,这可能会导致更多的混乱。

You could also use html.fragment_fromstring , which doesn't add tags in that case, but would raise an error because the fragment isn't valid. 您也可以使用html.fragment_fromstring ,在这种情况下不会添加标记,但会因为片段无效而引发错误。

As for why the text sticks to the img tag: that's how lxml stores text. 至于为什么文本坚持img标签:这就是lxml如何存储文本。 Take this example: 举个例子:

>>> p = html.fromstring("<p>spam<br />eggs</p>")
>>> br = p.find("br")
>>> p.text
'spam'
>>> br.text       # empty
>>> br.tail       # this is where text that comes after a tag is stored
'eggs'

So by moving a tag, you also move it's tail. 因此,通过移动标签,您也可以移动它的尾部。

lxml.html is a kinder, gentler xml processor that tries to make sense of invalid xml. lxml.html是一个更友善,更温和的xml处理器,试图理解无效的xml。 The sting you passed in is just junk from an xml perspective, but lxml.html wrapped it in a span element to make it valid again. 你传入的刺痛只是从xml角度来看是垃圾,但是lxml.html将它包装在span元素中以使其再次有效。 If you don't want lxml.html guestimating, stick with lxml.etree.fromstring(). 如果你不想要lxml.html guestimating,请坚持使用lxml.etree.fromstring()。 That version will reject the string. 该版本将拒绝该字符串。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM