[英]Get results from all the pages using JSoup
I'm using the jsoup library and today I got a problem. 我正在使用jsoup库,今天遇到了一个问题。 I have to scrape DuckDuckGo and get all the titles of the results of a query for every page, but using 我必须抓取DuckDuckGo并获取每个页面的查询结果的所有标题,但是使用
Document doc = Jsoup.connect("https://duckduckgo.com/html/?q=" + query).get();
I get only the results about the first page. 我只得到有关第一页的结果。 How can I continue to the next pages? 我如何继续下一页?
You need to extract the form parameters out of each page to get the request parameters for the next page. 您需要从每页中提取表单参数,以获取下一页的请求参数。 This is how: 这是这样的:
public static Map<String, String> getFormParams(final Document doc) {
return doc.select("div.nav-link > form")
.first()
.select("input")
.stream()
.filter((input) -> {
return input.attr("name") != null && !input.attr("name").equals("");
})
.collect(Collectors.toMap(input -> input.attr("name"), input -> input.attr("value")));
}
public static void main(final String... args) throws IOException {
final String baseURL = "https://duckduckgo.com/html";
final Connection conn = Jsoup.connect(baseURL)
.userAgent("Mozilla/5.0 (Linux; Android 4.0.4; Galaxy Nexus Build/IMM76B) AppleWebKit/535.19 (KHTML, like Gecko) Chrome/18.0.1025.133 Mobile Safari/535.19");
conn.data("q", "search phrase"); // Change "search phrase"
// 1st page
final Document page1 = conn.get();
final Map<String, String> formParams
= getFormParams(page1);
// 2nd page
final Document page2 = conn.data(formParams).get();
}
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.