简体   繁体   English

JSOUP Java Web抓取/解析

[英]JSOUP Java Web Scraping / Parsing

I wish to create a program where I submit 1 link, and I extract certain features from that link (ie download count, likecount...etc). 我希望创建一个提交1个链接的程序,然后从该链接中提取某些功能(例如,下载计数,likecount等)。 I can extract these fine because they are just headers. 我可以提取这些很好的内容,因为它们只是标题。 But I do not understand how to extract the title of a link within another link. 但是我不明白如何在另一个链接中提取链接的标题。 Just as an example, if I put in google.com I wish to extract the title "Show X amount of results found", which is another link, but X is not static (ie the title of the link isn't static, it changes depending on the number of results (in my cases, runs)) 举个例子,如果我输入google.com,我希望提取标题“显示X找到的结果数量”,这是另一个链接,但是X不是静态的(即链接的标题不是静态的,它根据结果​​数(在我的情况下为运行)进行更改)

To explain a bit better my code is: 为了更好地解释我的代码是:

import org.jsoup.Jsoup;


public static void main(String[] args) throws Exception {
    String url = "https://www.openml.org/t/31";
    Document document = Jsoup.connect(url).get();

   // String question = document.select("#question .post-text").text();
   // System.out.println("Question: " + question);

    Elements title = document.select("div#subtitle");
    System.out.println("Title:  " + title.text()); 

    Elements downloadcount = document.select("span#downloadcount");
                System.out.println(downloadcount.text());

    Elements likecount = document.select("span#likecount");
                System.out.println(likecount.text());

    Elements nr_of_issues = document.select("span#nr_of_issues");
                System.out.println(nr_of_issues.text());      

    String runs = ("<i class=\"fa fa-star\"></i> <a href=\"#taskruns\" data-toggle=\"tab\">396900 runs submitted</a>");
    Document number = Jsoup.parse(runs);

            Element link = number.select("a").first();
            String linkText = number.text();
            System.out.println(linkText);
        }
 }

The title, downloadcount, likecount, and nr_of_issues work fine because they aren't links. 标题,downloadcount,likecount和nr_of_issues可以正常工作,因为它们不是链接。 Just the "runs" is not working. 只是“运行”不起作用。 I cannot implement the String runs as that HTML code because its always changing ( as you can see right now its at 396900 , but what if tomorrow it changes to 400000?) 我无法实现String那样运行的HTML代码,因为它总是在变化( 如您现在所看到的396900 ,但是明天将其更改为400000怎么办?)

Building off of my comment on the OP, you can see that the text we want to reference is not static, but there is an element above it which is, which has an id="detail". 根据我对OP的评论,您可以看到我们要引用的文本不是静态的,但是在其上方有一个元素,该元素具有id =“ detail”。

网站HTML中的元素位置

We need to reference the parent element, and then get the child from it, assuming that the child always stays as a child element of the first child div element (hopefully Inception wasn't a confusing movie). 我们需要引用父元素,然后从中获取子元素,并假设子元素始终作为第一个子div元素的子元素保持(希望Inception并不是一部令人困惑的电影)。

Here's how we can do it in Java: 这是我们在Java中的方法:

public static void main(String[] args) throws Exception {
    String url = "https://www.openml.org/t/31";
    Document doc = Jsoup.connect(url).get();

    Element parentElement = doc.select("div#detail").first();
    Elements h2Element = parentElement.child(1).select("h2");
    System.out.println(h2Element.text());
}

Running the above Java will print: 运行上面的Java将打印:

396928 Runs 396928运行

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM