如何在Jsoup中获得班级子女

Question

I want to scrape comment from website. 我想从网站上抓取评论。 I am having trouble to get p tag inside class in jsoup. 我在jsoup的类中无法获取p标签。 Example html code is below 示例html代码如下

<html>
 <head>
  <title>My webpage</title>
 </head>
 <body>
  <div class="container">
     <div class="comment">
      <p>This is comment</p>
     </div>
  </div>
 </body> 
</html>

Here is my java code 这是我的java代码

public static void main(String args[]){
    Document doc = null;
    try {

        doc = Jsoup.connect("https://homeshopping.pk/products/Amazon-Fire-Phone-%284G%2C-32GB%2C-Black%29-Price-in-Pakistan.html").get();
        System.out.println("Connect successfully");
        org.jsoup.select.Elements element =  doc.select("div.post-message");

        System.out.println(element.get(0).text());
    } catch (IOException e) {
        // TODO Auto-generated catch block
        e.printStackTrace();
    }

}
}

Answer 1

The comments section of the page you are trying to fetch is not a simple HTML contant. 您要获取的页面的注释部分不是简单的HTML内容。 The comments are loaded to the DOM by Javascript after the initial page load. 初始页面加载后，注释将通过Javascript加载到DOM。 JSoup is an HTML parser, so you can not fetch the comments of the page by Jsoup. JSoup是HTML解析器，因此您无法通过Jsoup获取页面的注释。 To fetch this kind of content you need an embedded browser component. 要获取此类内容，您需要一个嵌入式浏览器组件。 Take a look at this answer : Is there a way to embed a browser in Java? 看一下这个答案：有没有办法在Java中嵌入浏览器？

The below code is for the specific HTML string you provided. 以下代码用于您提供的特定HTML字符串。

Try this: 尝试这个：

import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.select.Elements;    
public class Test {   

public static void main(String[] arg)
{ 
    Document doc = null; 
    try { 

        doc = Jsoup.parse("<html> "
                + "<head>  "
                + "<title>My webpage</title> "
                + "</head> <body>  <div class=\"container\">     "
                + "<div class=\"comment\">      "
                + "<p>This is comment</p>    "
                + " </div>  </div> </body></html> ");

                Elements element = doc.select(".container").select(".comment"); 
                System.out.println(element.get(0).select("p").text()); 

    } 
    catch (Exception e) 
    { 
        e.printStackTrace(); } 

}   
}

For connecting the url use : 要连接网址，请使用：

doc = Jsoup.connect("https://homeshopping.pk/products/Amazon-Fire-Phone-%284G%2C-32GB%2C-Black%29-Price-in-Pakistan.html").timeout(60*1000).userAgent("Mozilla").get();

Answer 2

To extend Arijit's solution, if there are multiple <div> tags with a comment class, you could try: 为了扩展Arijit的解决方案，如果存在带有comment类的多个<div>标签，则可以尝试：

Document doc = null;
    try
    {

        doc = Jsoup.parse("<html> " + "<head>  " + "<title>My webpage</title> "
                + "</head> <body>  <div class=\"container\">     " + "<div class=\"comment foo\">      "
                + "<p>This is comment</p>    " + " </div>  </div> </body></html> ");

        Elements comments = doc.getElementsByAttributeValueMatching("class", "comment");
        Iterator<Element> iter = comments.iterator();
        while(iter.hasNext())
        {
            Element e = iter.next();
            System.out.println(e.getElementsByTag("p").text());
        }

    }
    catch (Exception e)
    {
        e.printStackTrace();
    }

If there are other tags that share the comment class you can use e.tagName() to check that it is a <div> . 如果还有其他共享comment类的标签，则可以使用e.tagName()来检查它是否为<div> 。

Answer 3

If your goal is to print out This is comment , you could try something like this: 如果您的目标是打印出This is comment ，则可以尝试如下操作：

org.jsoup.select.Elements element = doc.select("div.container").select("div.comment");
System.out.println(element.get(0).text());

如何在Jsoup中获得班级子女

问题描述

3 个解决方案

解决方案1
2 2016-09-12 19:44:49

解决方案2
1 2016-09-12 19:51:48

解决方案3
0 2016-09-12 19:43:46

如何在Jsoup中获得班级子女

问题描述

3 个解决方案

解决方案1 2 2016-09-12 19:44:49

解决方案2 1 2016-09-12 19:51:48

解决方案3 0 2016-09-12 19:43:46

解决方案1
2 2016-09-12 19:44:49

解决方案2
1 2016-09-12 19:51:48

解决方案3
0 2016-09-12 19:43:46