简体   繁体   中英

In lxml, how do I remove a tag but retain all contents?

The problem is this: I have an XML fragment like so:

<fragment>text1 <a>inner1 </a>text2 <b>inner2</b> <c>t</c>ext3</fragment>

For the result, I want to remove all <a> - and <c> -Tags, but retain their (text)-contents, and childnodes just as they are. Also, the <b> -Element should be left untouched. The result should then look thus

<fragment>text1 inner<d>1</d> text2 <b>inner2</b> text3</fragment>

For the time being, I'll revert to a very dirty trick: I'll etree.tostring the fragment, remove the offending tags via regex, and replace the original fragment with the etree.fromstring result of this (not the real code, but should go something like this):

from lxml import etree
fragment = etree.fromstring("<fragment>text1 <a>inner1 </a>text2 <b>inner2</b> <c>t</c>ext3</fragment>")
fstring = etree.tostring(fragment)
fstring = fstring.replace("<a>","")
fstring = fstring.replace("</a>","")
fstring = fstring.replace("<c>","")
fstring = fstring.replace("</c>","")
fragment = etree.fromstring(fstring)

I know that I can probably use xslt to achieve this, and I know that lxml can make use of xslt, but there has to be a more lxml native approach?

For reference: I've tried getting there with lxml's element.replace, but since I want to insert text where there was an element node before, I don't think I can do that.

Try this: http://lxml.de/api/lxml.etree-module.html#strip_tags

>>> etree.strip_tags(fragment,'a','c')
>>> etree.tostring(fragment)
'<fragment>text1 inner1 text2 <b>inner2</b> text3</fragment>'

Use Cleaner function of lxml to remove tags from html content. Below is an example to do what you want. For an HTML document, Cleaner is a better general solution to the problem than using strip_elements, because in cases like this you want to strip out more than just the tag; you also want to get rid of things like onclick=function() attributes on other tags.

import lxml
from lxml.html.clean import Cleaner
cleaner = Cleaner()
cleaner.remove_tags = ['p']
remove_tags:

A list of tags to remove. Only the tags will be removed, their content will get pulled up into the parent tag.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM