lxml删除标记内的未包装文本

Question

Here is my python code with lxml 这是我的lxml python代码

import urllib.request
from lxml import etree
#import lxml.html as html
from copy import deepcopy
from lxml import etree
from lxml import html


some_xml_data = "<span>text1<div>ddd</div>text2<div>ddd</div>text3</span>"
root = etree.fromstring(some_xml_data)
[c] = root.xpath('//span')
print(etree.tostring(root))  #b'<span>text1<div>ddd</div>text2<div>ddd</div>text3</span>' #output as expected
#but if i do some changes
for e in c.iterchildren("*"):
    if e.tag == 'div':
        e.getparent().remove(e)

print(etree.tostring(root)) #b'<span>text1</span>' text2 and text3 removed! how to prevent this deletion?

It looks like after I do some changes on lxml tree (delete some tags) lxml also remove some unwrapped text! 它看起来像我在lxml树上做了一些更改（删除一些标签）lxml也删除了一些未包装的文本！ how to prevent lxml doing this and save unwrpapped text? 如何防止lxml这样做并保存未翻录的文本？

Answer 1

The text after 之后的文字 node is called tail , and they can be reserved by appending to parent's text, here is a sample: node被称为tail ，它们可以通过附加到父文本来保留，这里有一个示例：

In [1]: from lxml import html

In [2]: s = "<span>text1<div>ddd</div>text2<div>ddd</div>text3</span>"
   ...: 

In [3]: tree = html.fromstring(s)

In [4]: for node in tree.iterchildren("div"):
   ...:     if node.tail:
   ...:         node.getparent().text += node.tail
   ...:     node.getparent().remove(node)
   ...:     

In [5]: html.tostring(tree)
Out[5]: b'<span>text1text2text3</span>'

I use html as it's more likely the structure than xml. 我使用html因为它比xml更可能是结构。 And you can simply iterchildren with div to avoid additional check for tag. 你可以简单iterchildren与div ，以避免标签额外的检查。

lxml删除标记内的未包装文本

问题描述

1 个解决方案

解决方案1
3 已采纳 2016-07-29 14:35:26

lxml删除标记内的未包装文本

问题描述

1 个解决方案

解决方案1 3 已采纳 2016-07-29 14:35:26

解决方案1
3 已采纳 2016-07-29 14:35:26