lxml使用元素移动文本

Question

I have an issue with wrapping image with a div. 我有一个用div包装图像的问题。

from lxml.html import fromstring
from lxml import etree

tree = fromstring('<img src="/img.png"/> some text')
div = etree.Element('div')
div.insert(0, tree.find('img'))
tree.insert(0, div)
print etree.tostring(tree)

<span><div><img src="/img.png"/> some text</div></span>

Why does it add a span and how can I make it wrap image without text? 为什么它会添加一个跨度，如何在没有文本的情况下使其包装图像？

Answer 1

Because lxml is acutally an xml parser. 因为lxml实际上是一个xml解析器。 It has some forgiving parsing rules that allows it to parse html (the lxml.html part), but it will internally always build a valid tree. 它有一些lxml.html解析规则，允许它解析html（ lxml.html部分），但它将在内部始终构建一个有效的树。

'<img src="/img.png"/> some text' isn't a tree, as it has no single root element, there is a img element, and a text node. '<img src="/img.png"/> some text'不是树，因为它没有单个根元素，有一个img元素和一个文本节点。 To be able to store this snipplet internally, lxml needs to wrap it in a suitable tag. 为了能够在内部存储此snipplet，lxml需要将其包装在合适的标记中。 If you give it a string alone, it will wrap it in a p tag. 如果你单独给它一个字符串，它会将它包装在一个p标签中。 Earlier versions just wrapped everything in html tags, which can lead to even more confusion. 早期的版本只是将所有内容都包装在html标签中，这可能会导致更多的混乱。

You could also use html.fragment_fromstring , which doesn't add tags in that case, but would raise an error because the fragment isn't valid. 您也可以使用html.fragment_fromstring ，在这种情况下不会添加标记，但会因为片段无效而引发错误。

As for why the text sticks to the img tag: that's how lxml stores text. 至于为什么文本坚持img标签：这就是lxml如何存储文本。 Take this example: 举个例子：

>>> p = html.fromstring("<p>spam<br />eggs</p>")
>>> br = p.find("br")
>>> p.text
'spam'
>>> br.text       # empty
>>> br.tail       # this is where text that comes after a tag is stored
'eggs'

So by moving a tag, you also move it's tail. 因此，通过移动标签，您也可以移动它的尾部。

Answer 2

lxml.html is a kinder, gentler xml processor that tries to make sense of invalid xml. lxml.html是一个更友善，更温和的xml处理器，试图理解无效的xml。 The sting you passed in is just junk from an xml perspective, but lxml.html wrapped it in a span element to make it valid again. 你传入的刺痛只是从xml角度来看是垃圾，但是lxml.html将它包装在span元素中以使其再次有效。 If you don't want lxml.html guestimating, stick with lxml.etree.fromstring(). 如果你不想要lxml.html guestimating，请坚持使用lxml.etree.fromstring（）。 That version will reject the string. 该版本将拒绝该字符串。

lxml使用元素移动文本

问题描述

2 个解决方案

解决方案1
3 2013-07-17 22:44:34

解决方案2
1 2013-07-17 22:37:59

lxml使用元素移动文本

问题描述

2 个解决方案

解决方案1 3 2013-07-17 22:44:34

解决方案2 1 2013-07-17 22:37:59

解决方案1
3 2013-07-17 22:44:34

解决方案2
1 2013-07-17 22:37:59