简体   繁体   English

当apache.httpclient能够获取内容时,JSoup.connect会抛出403错误

[英]JSoup.connect throws 403 error while apache.httpclient is able to fetch the content

I am trying to parse HTML dump of any given page. 我试图解析任何给定页面的HTML转储。 I used HTML Parser and also tried JSoup for parsing. 我使用HTML Parser并尝试使用JSoup进行解析。

I found useful functions in Jsoup but I am getting 403 error while calling Document doc = Jsoup.connect(url).get(); 我在Jsoup中找到了有用的函数,但在调用Document doc = Jsoup.connect(url).get();时遇到403错误Document doc = Jsoup.connect(url).get();

I tried HTTPClient, to get the html dump and it was successful for the same url. 我尝试了HTTPClient,以获得html转储,并且它在同一个网址上获得了成功。

Why is JSoup giving 403 for the same URL which is giving content from commons http client? 为什么JSoup为同一个URL提供403,它提供来自公共http客户端的内容? Am I doing something wrong? 难道我做错了什么? Any thoughts? 有什么想法吗?

Working solution is as follows (Thanks to Angelo Neuschitzer for reminding to put it as a solution): 工作解决方案如下(感谢Angelo Neuschitzer提醒将其作为解决方案):

Document doc = Jsoup.connect(url).userAgent("Mozilla").get();
Elements links = doc.getElementsByTag(HTML.Tag.CITE.toString);
for (Element link : links) {
            String linkText = link.text();
            System.out.println(linkText);
}

So, userAgent does the trick :) 所以, userAgent做的诀窍:)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM