简体   繁体   中英

Jsoup links extraction

hello guys I am trying to extract all the anchor links from aol but it is not working. The same code is working with yahoo bing. The question is what would be the problem

Document document5 = Jsoup.connect("www.aol.com").get();
Elements links5 = document5.select("a");

for (Element link5 : links5) {
    out.println(link5.attr("href"));
}

As per the comments on your previous question :

even after im specifying the protocol...only google and aol are not working, same is working with yahoo, bing and ask.... my project is to implement a metasearch engine....i am able to extract links from yahoo, bing and ask...but same does not work with google and aol...what may be the reason..??

They've blocked your request because you're acting as a robot/leecher which may violate their terms of service. Their websites are very frequently requested and they don't want to unnecessarily waste their bandwidth to robots/leechers which actually only need a small part of the response.

Use their public web service APIs instead of parsing the HTML of the entire website. For Google, that's for example " Google Custom Search API ". Other search engine providers offer similar web services. Note that those web services doesn't return bloated HTML, but compact JSON or XML data which is much easier to parse/extract using JSON/XML parsers.

Your user agent might be missing. Add a user agent:

String USER_AGENT = "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/44.0.2403.157 Safari/537.36";
Jsoup.connect(link).userAgent(USER_AGENT).get();

您需要指出协议:

Document document5 = Jsoup.connect("http://www.aol.com/").get();

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM