简体   繁体   中英

JSOUP Java Web Scraping / Parsing

I wish to create a program where I submit 1 link, and I extract certain features from that link (ie download count, likecount...etc). I can extract these fine because they are just headers. But I do not understand how to extract the title of a link within another link. Just as an example, if I put in google.com I wish to extract the title "Show X amount of results found", which is another link, but X is not static (ie the title of the link isn't static, it changes depending on the number of results (in my cases, runs))

To explain a bit better my code is:

import org.jsoup.Jsoup;


public static void main(String[] args) throws Exception {
    String url = "https://www.openml.org/t/31";
    Document document = Jsoup.connect(url).get();

   // String question = document.select("#question .post-text").text();
   // System.out.println("Question: " + question);

    Elements title = document.select("div#subtitle");
    System.out.println("Title:  " + title.text()); 

    Elements downloadcount = document.select("span#downloadcount");
                System.out.println(downloadcount.text());

    Elements likecount = document.select("span#likecount");
                System.out.println(likecount.text());

    Elements nr_of_issues = document.select("span#nr_of_issues");
                System.out.println(nr_of_issues.text());      

    String runs = ("<i class=\"fa fa-star\"></i> <a href=\"#taskruns\" data-toggle=\"tab\">396900 runs submitted</a>");
    Document number = Jsoup.parse(runs);

            Element link = number.select("a").first();
            String linkText = number.text();
            System.out.println(linkText);
        }
 }

The title, downloadcount, likecount, and nr_of_issues work fine because they aren't links. Just the "runs" is not working. I cannot implement the String runs as that HTML code because its always changing ( as you can see right now its at 396900 , but what if tomorrow it changes to 400000?)

Building off of my comment on the OP, you can see that the text we want to reference is not static, but there is an element above it which is, which has an id="detail".

网站HTML中的元素位置

We need to reference the parent element, and then get the child from it, assuming that the child always stays as a child element of the first child div element (hopefully Inception wasn't a confusing movie).

Here's how we can do it in Java:

public static void main(String[] args) throws Exception {
    String url = "https://www.openml.org/t/31";
    Document doc = Jsoup.connect(url).get();

    Element parentElement = doc.select("div#detail").first();
    Elements h2Element = parentElement.child(1).select("h2");
    System.out.println(h2Element.text());
}

Running the above Java will print:

396928 Runs

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM