Why JSoup does not read all the elements of the page?

Question

Today I started "to play" with JSoup . I wanted to know how much powerful JSoup is, so I looked for a webpage with a lot of elements and I tried to retrieve all of them. And I found what I was looking for: http://www.top1000.ie/companies .

This is a list with a lot of elements (1000) that are similar (each company of the list). Just change the text inside of them so what I have tried to retrieve it is that text, but I am only able to get the first 20 elements, not the rest.

This is my simple code:

package retrieveInfo;

import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;

public class Retrieve {

    public static void main(String[] args) throws Exception{
        String url = "http://www.top1000.ie/companies";
        Document document = Jsoup.connect(url)
                 .userAgent("Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:25.0) Gecko/20100101 Firefox/25.0")
                 .timeout(1000*5)
                 .get();

        Elements companies = document.body().select(".content .name");
        for (Element company : companies) {
            System.out.println("Company: " + company.text());
        }
    }

}

I though that it could be that the page did not have time to load, so it is the reason why I put .timeout(1000*5) to wait 5 seconds but I only can get the first 20 elements of the list.

Does JSoup have a limit of elements that you can retrieve from a webpage? I think it should not because it seems that it is prepared for that purpose so I think I am missing something in my code.

Any help would be appreciated. Thanks in advance!

Answer 1

NEW ANSWER:

I looked at the website you are trying to parse. the problem is, that only the first 20 comanpies are loaded with the first call of the site. the rest ist loaded via AJAX. And Jsoup does not interpret or run JavaScript. You can use selenium webdriver for that, or figure out the AJAX calls directly.

OLD:

Jsoup limits to 1M, if not told otherwise via the maxBodySize() method. So you may want to do this:

Document document = Jsoup.connect(url)
             .userAgent("Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:25.0) Gecko/20100101 Firefox/25.0")
             .maxBodySize(0)
             .timeout(1000*5)
             .get();

Beware, the above turns off the size limit altogether. This may not be a good idea, since Jsoup builds the DOM in memory, so you may run into problems with memory heap size for big documents. If you do have problems like this, it may help to switch to another SAX based HTML parser.

Answer 2

The site initially loads only the first 20 elements. When you scroll down the next block of elements is loaded by a script (a POST to http://www.top1000.ie/companies?page=2 ). The script then adds the received elements to the DOM.

However, the response you get from a POST to /companies?page= is a JSON.

{
 "worked":true,
 "has_more":true,
 "next_url":"/companies?page=3",
 "html":"..."
 ...
}

Here the "html" field seems to contain the elements that will be added to the DOM.

Using Jsoup to get the data will be tedious, because Jsoup will add all kind of tags around the actual JSON and will also escape certain characters.

I think you would be better off using one of the ways described in this post , connect to http://www.top1000.ie/companies?page=1 and read the data page by page.

Edit here's a minimal example on how you could approach this problem using HttpURLConnection and the minimal-json parser.

void readPage(int page) throws IOException {
    URL url = new URL("http://www.top1000.ie/companies?page=" + page);

    HttpURLConnection connection = (HttpURLConnection) url.openConnection();
    connection.setDoOutput(true);
    connection.setRequestMethod("POST");

    try (OutputStreamWriter writer = new OutputStreamWriter(connection.getOutputStream())) {
        // no need to post any data for this page
        writer.write("");
    }

    if (connection.getResponseCode() == HttpURLConnection.HTTP_OK) {
        try (Reader reader = new InputStreamReader(connection.getInputStream())) {
            String html = Json
                .parse(reader)
                .asObject()
                .getString("html", "");

            Elements companies = Jsoup
                .parse(html)
                .body().select(".content .name");

            for (Element company : companies) 
                System.out.println("Company: " + company.text());
        }
    } else {
        // handle HTTP error code.
    }
}

Here we use HttpURLConnection to send a POST request (without any data) to the URL, use the JSON parser to get the "html" field from the result and then parse it using Jsoup . Just call the method in a loop for the pages you want to read.

Why JSoup does not read all the elements of the page?

Question

2 answers

solution1
4 2016-04-19 15:01:02

NEW ANSWER:

OLD:

solution2
2 ACCPTED 2016-04-19 15:09:07

Why JSoup does not read all the elements of the page?

Question

2 answers

solution1 4 2016-04-19 15:01:02

NEW ANSWER:

OLD:

solution2 2 ACCPTED 2016-04-19 15:09:07

solution1
4 2016-04-19 15:01:02

solution2
2 ACCPTED 2016-04-19 15:09:07