简体   繁体   English

lxml删除标记内的未包装文本

[英]lxml removes unwrapped text inside tag

Here is my python code with lxml 这是我的lxml python代码

import urllib.request
from lxml import etree
#import lxml.html as html
from copy import deepcopy
from lxml import etree
from lxml import html


some_xml_data = "<span>text1<div>ddd</div>text2<div>ddd</div>text3</span>"
root = etree.fromstring(some_xml_data)
[c] = root.xpath('//span')
print(etree.tostring(root))  #b'<span>text1<div>ddd</div>text2<div>ddd</div>text3</span>' #output as expected
#but if i do some changes
for e in c.iterchildren("*"):
    if e.tag == 'div':
        e.getparent().remove(e)

print(etree.tostring(root)) #b'<span>text1</span>' text2 and text3 removed! how to prevent this deletion?

It looks like after I do some changes on lxml tree (delete some tags) lxml also remove some unwrapped text! 它看起来像我在lxml树上做了一些更改(删除一些标签)lxml也删除了一些未包装的文本! how to prevent lxml doing this and save unwrpapped text? 如何防止lxml这样做并保存未翻录的文本?

The text after 之后的文字 node is called tail , and they can be reserved by appending to parent's text, here is a sample: node被称为tail ,它们可以通过附加到父文本来保留,这里有一个示例:

In [1]: from lxml import html

In [2]: s = "<span>text1<div>ddd</div>text2<div>ddd</div>text3</span>"
   ...: 

In [3]: tree = html.fromstring(s)

In [4]: for node in tree.iterchildren("div"):
   ...:     if node.tail:
   ...:         node.getparent().text += node.tail
   ...:     node.getparent().remove(node)
   ...:     

In [5]: html.tostring(tree)
Out[5]: b'<span>text1text2text3</span>'

I use html as it's more likely the structure than xml. 我使用html因为它比xml更可能是结构。 And you can simply iterchildren with div to avoid additional check for tag. 你可以简单iterchildrendiv ,以避免标签额外的检查。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM