简体   繁体   中英

Get results from all the pages using JSoup

I'm using the jsoup library and today I got a problem. I have to scrape DuckDuckGo and get all the titles of the results of a query for every page, but using

Document doc = Jsoup.connect("https://duckduckgo.com/html/?q=" + query).get();

I get only the results about the first page. How can I continue to the next pages?

You need to extract the form parameters out of each page to get the request parameters for the next page. This is how:

   public static Map<String, String> getFormParams(final Document doc) {
        return doc.select("div.nav-link > form")
                .first()
                .select("input")
                .stream()
                .filter((input) -> {
                    return input.attr("name") != null && !input.attr("name").equals("");
                })
                .collect(Collectors.toMap(input -> input.attr("name"), input -> input.attr("value")));
    }

    public static void main(final String... args) throws IOException {
        final String baseURL = "https://duckduckgo.com/html";
        final Connection conn = Jsoup.connect(baseURL)
                .userAgent("Mozilla/5.0 (Linux; Android 4.0.4; Galaxy Nexus Build/IMM76B) AppleWebKit/535.19 (KHTML, like Gecko) Chrome/18.0.1025.133 Mobile Safari/535.19");
        conn.data("q", "search phrase"); // Change "search phrase"

        // 1st page
        final Document page1 = conn.get();

        final Map<String, String> formParams
                = getFormParams(page1);

        // 2nd page
        final Document page2 = conn.data(formParams).get();
    }

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM