简体   繁体   中英

jsoup: parse data of certain tag which is just after a particular tag

I am trying to parse certain information through jsoup in Java from last 3 days -_-, this is my code:

Document document = Jsoup.connect(urlofpage).get();
Elements links = document.select(".contentBox");

    for (Element link : links) {
        // String name = link.text();
        String title = link.select("h2").text();
        String content = link.select("p").text();
        System.out.println(title);
        System.out.println(content);
    }

It is fetching the data as it is directed, fetching the data of h2 and p separated, but the problem is, I want to parse the data inside of <p> tag which is just after every <h2> tag.

For example (HTML content):

<h2>main content</h2>
<div class="acx"><div>
<p>content</p>
<p>content 2</p>

<h2>content 2</h2>
<div class="acx"><div>
<p>new content od 2</p>
<p>new 2</p>

Now it should fetch like (in array):

array[0] = "content content 2",
array[1] = "new content od 2 new 2",  

Any solutions?

You can play with "~" next element selector. For example

link.select("h2 ~ p").get(0).text(); // returns "content"
link.select("h2 ~ p").get(1).text(); // returns "new content od 2"

Just use your initial approach to iterate all necessary tags within selected .contentBox class:

Document document = Jsoup.connect(urlofpage).get();
Elements links = document.select(".contentBox");

       for (Element link : links) {
            for (Element h2Tag : link.select("h2"))
            {
               System.out.println(h2Tag.text());
            }
            for (Element pTag : link.select("p"))
            {
               System.out.println(pTag.text());
            }
         }

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM