简体   繁体   中英

Using JSoup to parse text between two different tags

I have the following HTML...

<h3 class="number">
<span class="navigation">
6:55 <a href="/results/result.html" class="under"><b>&raquo;</b></a>
</span>**This is the text I need to parse!**</h3>

I can use the following code to extract the text from h3 tag.

Element h3 = doc.select("h3").get(0);

Unfortunately, that gives me everything in that tag.

6:55 &raquo; This is the text I need to parse!

Can I use Jsoup to parse between different tags? Is there a best practice for doing this (regex?)

(regex?)

No, as you can read in the answers of this question , you can't parse HTML using a regular expression.

Try this:

Element h3 = doc.select("h3").get(0);
String h3Text = h3.text();
String spanText = h3.select("span").get(0).text();
String textBetweenSpanEndAndH3End = h3Text.replace(spanText, "");

No, JSoup wasn't made for this. It's supposed to parse something hierachical. Searching for a text which is between an end-tag and a start-tag, or the other way around wouldn't make any sense for JSoup. That's what regular expressions are for.

But you should of course narrow it down as much as you can using JSoup first, before you shoot with a regex at the string.

Just use ownText()

   @Test
    void innerTextCase() {
        String sample = "<h3 class=\"number\">\n" +
                "<span class=\"navigation\">\n" +
                "6:55 <a href=\"/results/result.html\" class=\"under\"><b>&raquo;</b></a>\n" +
                "</span>**This is the text I need to parse!**</h3>\n";
        Assertions.assertEquals("**This is the text I need to parse!**", 
                Jsoup.parse(sample).select("h3").first().ownText());
    }

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM