JSOUP Java Web Scraping / Parsing

Question

I wish to create a program where I submit 1 link, and I extract certain features from that link (ie download count, likecount...etc). I can extract these fine because they are just headers. But I do not understand how to extract the title of a link within another link. Just as an example, if I put in google.com I wish to extract the title "Show X amount of results found", which is another link, but X is not static (ie the title of the link isn't static, it changes depending on the number of results (in my cases, runs))

To explain a bit better my code is:

import org.jsoup.Jsoup;


public static void main(String[] args) throws Exception {
    String url = "https://www.openml.org/t/31";
    Document document = Jsoup.connect(url).get();

   // String question = document.select("#question .post-text").text();
   // System.out.println("Question: " + question);

    Elements title = document.select("div#subtitle");
    System.out.println("Title:  " + title.text()); 

    Elements downloadcount = document.select("span#downloadcount");
                System.out.println(downloadcount.text());

    Elements likecount = document.select("span#likecount");
                System.out.println(likecount.text());

    Elements nr_of_issues = document.select("span#nr_of_issues");
                System.out.println(nr_of_issues.text());      

    String runs = ("<i class=\"fa fa-star\"></i> <a href=\"#taskruns\" data-toggle=\"tab\">396900 runs submitted</a>");
    Document number = Jsoup.parse(runs);

            Element link = number.select("a").first();
            String linkText = number.text();
            System.out.println(linkText);
        }
 }

The title, downloadcount, likecount, and nr_of_issues work fine because they aren't links. Just the "runs" is not working. I cannot implement the String runs as that HTML code because its always changing ( as you can see right now its at 396900 , but what if tomorrow it changes to 400000?)

Answer 1

Building off of my comment on the OP, you can see that the text we want to reference is not static, but there is an element above it which is, which has an id="detail".

We need to reference the parent element, and then get the child from it, assuming that the child always stays as a child element of the first child div element (hopefully Inception wasn't a confusing movie).

Here's how we can do it in Java:

public static void main(String[] args) throws Exception {
    String url = "https://www.openml.org/t/31";
    Document doc = Jsoup.connect(url).get();

    Element parentElement = doc.select("div#detail").first();
    Elements h2Element = parentElement.child(1).select("h2");
    System.out.println(h2Element.text());
}

Running the above Java will print:

396928 Runs

JSOUP Java Web Scraping / Parsing

Question

1 answers

solution1
0 ACCPTED 2018-04-08 23:43:00

JSOUP Java Web Scraping / Parsing

Question

1 answers

solution1 0 ACCPTED 2018-04-08 23:43:00

solution1
0 ACCPTED 2018-04-08 23:43:00