简体   繁体   中英

How to remove all elements on text level with Jsoup?

I'm working on a project and i'm only interested in the page layout and not in the text. I'm currently having trouble getting rid of every element at text level. for example:

<div>
    <ul>
        <li>some menu item</li>
        <li>some menu item</li>
        <li>some menu item</li>
    </ul>
</div>
<div>
    <h3>Tile of some text</h3>
    <p></p>
    <p>some text</p>
    <ul>
        <li>some other text</li>
        <li>some other text</li>
        <li>some other text</li>
    </ul>
</div>

I want to get rid of the ul, li, p and h3 elements on text level but keep the div and the list with menu items as this is part of the layout of the page. How do I do this with Jsoup?

I've been trying to do this with the document.select() and then .remove() the elements but the select function is not made for such non standard queries.

EDIT: The end result I want to get is:

<div>
    <ul>
        <li>some menu item</li>
        <li>some menu item</li>
        <li>some menu item</li>
    </ul>
</div>
<div>

</div>

As you can see it removes the list when the ul tag is on the same level as tags with text in them. The ul tag is part of the text that is on the page and has nothing to do with the layout of the page. The ul tag with menu items is important for the page as this defines there is a menu there and it has 3 different items.

You can select and remove all p , li and ul elements with standard:

doc.select("p").remove();
doc.select("ul").remove();
doc.select("li").remove();

I first found the tags that I want to get rid of, and then called empty() on their parent.

    public static void main(String[] args) {
        String html = "<div> <ul>  <li>some menu item</li>  <li>some menu item</li>  <li>some menu item</li> </ul></div><div> <h3>Tile of some text</h3> <p></p> <p>some text</p> <ul>  <li>some other text</li>  <li>some other text</li>  <li>some other text</li> </ul></div>";
        Document doc = Jsoup.parse(html.toString());
        Elements elements = doc.body().select("*");
        for (Element element : elements) {
            if ("h3".equals(element.tagName()) || "p".equals(element.tagName())) {
                element.parent().empty();
            }
        }
        System.out.println(doc.toString());
    }

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM