简体   繁体   English

如何使用lxml删除不在标签中的文本?

[英]How to remove text not in tag using lxml?

Now i got xml like following: 现在我得到如下的xml:

<div>
<p>the first paragraph</p>
<p>the sencond paragraph</p>
something others...
</div>

And i want remove these something others... from object content . 我想从对象content删除其他...

I know it can be got by using content.xpath('.//text()[not(ancestor::p)]') , but it seems be no good method to remove these text directly from object. 我知道可以通过使用content.xpath('.//text()[not(ancestor::p)]') ,但似乎不是直接从对象中删除这些文本的好方法。


Update: I tried //p[last()]/following::* , it does not works as i want... 更新:我试过//p[last()]/following::* ,它不能按我想要的方式工作...

They are stored in the tail attribute of the previous sibling tag, so to remove all these "something others..." do: 它们存储在上一个兄弟标记的tail属性中,因此要删除所有这些“其他...”,请执行以下操作:

for elem in document.iter():
    elem.tail = ''

edit : 编辑

To remove the tail texts of every last p sibling in the document: 要删除每一个最后的尾巴文本p文档中的兄弟:

for elem in document.iter():
    if elem.tag == 'p' and not elem.getnext():
        elem.tail = ''

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM