简体   繁体   中英

How to get children of class in jsoup

I want to scrape comment from website. I am having trouble to get p tag inside class in jsoup. Example html code is below

<html>
 <head>
  <title>My webpage</title>
 </head>
 <body>
  <div class="container">
     <div class="comment">
      <p>This is comment</p>
     </div>
  </div>
 </body> 
</html> 

Here is my java code

public static void main(String args[]){
    Document doc = null;
    try {

        doc = Jsoup.connect("https://homeshopping.pk/products/Amazon-Fire-Phone-%284G%2C-32GB%2C-Black%29-Price-in-Pakistan.html").get();
        System.out.println("Connect successfully");
        org.jsoup.select.Elements element =  doc.select("div.post-message");

        System.out.println(element.get(0).text());
    } catch (IOException e) {
        // TODO Auto-generated catch block
        e.printStackTrace();
    }

}
}

The comments section of the page you are trying to fetch is not a simple HTML contant. The comments are loaded to the DOM by Javascript after the initial page load. JSoup is an HTML parser, so you can not fetch the comments of the page by Jsoup. To fetch this kind of content you need an embedded browser component. Take a look at this answer : Is there a way to embed a browser in Java?

The below code is for the specific HTML string you provided.

Try this:

import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.select.Elements;    
public class Test {   

public static void main(String[] arg)
{ 
    Document doc = null; 
    try { 

        doc = Jsoup.parse("<html> "
                + "<head>  "
                + "<title>My webpage</title> "
                + "</head> <body>  <div class=\"container\">     "
                + "<div class=\"comment\">      "
                + "<p>This is comment</p>    "
                + " </div>  </div> </body></html> ");

                Elements element = doc.select(".container").select(".comment"); 
                System.out.println(element.get(0).select("p").text()); 

    } 
    catch (Exception e) 
    { 
        e.printStackTrace(); } 

}   
}

For connecting the url use :

doc = Jsoup.connect("https://homeshopping.pk/products/Amazon-Fire-Phone-%284G%2C-32GB%2C-Black%29-Price-in-Pakistan.html").timeout(60*1000).userAgent("Mozilla").get();

To extend Arijit's solution, if there are multiple <div> tags with a comment class, you could try:

Document doc = null;
    try
    {

        doc = Jsoup.parse("<html> " + "<head>  " + "<title>My webpage</title> "
                + "</head> <body>  <div class=\"container\">     " + "<div class=\"comment foo\">      "
                + "<p>This is comment</p>    " + " </div>  </div> </body></html> ");

        Elements comments = doc.getElementsByAttributeValueMatching("class", "comment");
        Iterator<Element> iter = comments.iterator();
        while(iter.hasNext())
        {
            Element e = iter.next();
            System.out.println(e.getElementsByTag("p").text());
        }

    }
    catch (Exception e)
    {
        e.printStackTrace();
    }

If there are other tags that share the comment class you can use e.tagName() to check that it is a <div> .

If your goal is to print out This is comment , you could try something like this:

org.jsoup.select.Elements element = doc.select("div.container").select("div.comment");
System.out.println(element.get(0).text());

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM