简体   繁体   English

Jsoup链接提取

[英]Jsoup links extraction

hello guys I am trying to extract all the anchor links from aol but it is not working. 您好,我正在尝试从aol提取所有锚链接,但是它不起作用。 The same code is working with yahoo bing. 相同的代码与yahoo bing一起使用。 The question is what would be the problem 问题是什么问题

Document document5 = Jsoup.connect("www.aol.com").get();
Elements links5 = document5.select("a");

for (Element link5 : links5) {
    out.println(link5.attr("href"));
}

As per the comments on your previous question : 根据您对上一个问题的评论

even after im specifying the protocol...only google and aol are not working, same is working with yahoo, bing and ask.... my project is to implement a metasearch engine....i am able to extract links from yahoo, bing and ask...but same does not work with google and aol...what may be the reason..?? 即使在即时通讯指定协议之后...仅google和aol无法正常工作,也正在与yahoo,bing和ask一起使用。...我的项目是实现元搜索引擎....我能够从yahoo提取链接,bing和ask ...但是google和aol不能正常使用...这可能是原因。

They've blocked your request because you're acting as a robot/leecher which may violate their terms of service. 他们阻止了您的请求,因为您扮演的机器人/窃贼可能违反了他们的服务条款。 Their websites are very frequently requested and they don't want to unnecessarily waste their bandwidth to robots/leechers which actually only need a small part of the response. 他们的网站经常被要求访问,他们不想不必要地将带宽浪费在实际上只需要响应的一小部分的机器人/抓取者身上。

Use their public web service APIs instead of parsing the HTML of the entire website. 使用其公共Web服务API而不是解析整个网站的HTML。 For Google, that's for example " Google Custom Search API ". 对于Google,例如“ Google自定义搜索API ”。 Other search engine providers offer similar web services. 其他搜索引擎提供商也提供类似的Web服务。 Note that those web services doesn't return bloated HTML, but compact JSON or XML data which is much easier to parse/extract using JSON/XML parsers. 请注意,这些Web服务不会返回肿的HTML,而是紧凑的JSON或XML数据,使用JSON / XML解析器更容易解析/提取。

Your user agent might be missing. 您的用户代理可能丢失。 Add a user agent: 添加用户代理:

String USER_AGENT = "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/44.0.2403.157 Safari/537.36";
Jsoup.connect(link).userAgent(USER_AGENT).get();

您需要指出协议:

Document document5 = Jsoup.connect("http://www.aol.com/").get();

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM