简体   繁体   中英

Jsoup remove ONLY html tags

What is proper way to remove ONLY html tags (preserve all custom/unknown tags) with JSOUP (NOT regex)?

Expected input:

<html>
  <customTag>
    <div> dsgfdgdgf </div>
  </customTag>
  <123456789/>
  <123>
  <html123/>
</html>

Expected output:

  <customTag>
     dsgfdgdgf
  </customTag>
  <123456789/>
  <123>
  <html123/>

I tried to use Cleaner with WhiteList.none(), but it removes custom tags also.

Also I tried:

String str = Jsoup.parse(html).text()

But it removes custom tags also.

This answer isn't good for me, because number of custom tags is infinity.

you might want to try something like this:

String[] tags = new String[]{"html", "div"};
Document thing = Jsoup.parse("<html><customTag><div>dsgfdgdgf</div></customTag><123456789/><123><html123/></html>");
for (String tag : tags) {
    for (Element elem : thing.getElementsByTag(tag)) {
        elem.parent().insertChildren(elem.siblingIndex(),elem.childNodes());
        elem.remove();
    }
}
System.out.println(thing.getElementsByTag("body").html());

Please note that <123456789/> and <123> don't conform to the xml standard, so they get escaped. Another downside may be that you have to explicitly write down all tags you don't like (aka all html tags) and it may be sloooooow. Haven't looked at how fast this is going to run.

MFG MiSt

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM