简体   繁体   English

无法使用Jsoup解析网址的完整html

[英]not able to parse complete html of a url using Jsoup

Jsoup library is not parsing complete html of a given url. Jsoup库未解析给定URL的完整html。 some divisions are missing from the orignial html of url. 网址的原始html中缺少某些部分。

Interesting thing: http://facebook.com/search.php?init=s:email&q=somebody@gmail.com&type=users 有趣的是: http : //facebook.com/search.php?init=s : email&q=somebody@gmail.com&type=users

if you give url mentioned above in Jsoup's official site http://try.jsoup.org/ it is correctly showing the exact html of the url by fetching, but the same result cant be found in the program using jsoup library. 如果您在Jsoup的官方网站http://try.jsoup.org/中提供了上面提到的url,则可以通过提取正确显示该URL的确切html,但是使用jsoup库在程序中找不到相同的结果。

here is my java code: 这是我的Java代码:

String url="http://facebook.com/search.php?init=s:email&q=somebody@gmail.com&type=users";

Document document = Jsoup.connect(url).userAgent("Mozilla/5.0 (Windows NT 6.2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/30.0.1599.69 Safari/537.36").get();

String question =document.toString();
System.out.println(" whole content: "+question);

clearly mentioned correct userAgent which is being used in their official site but, in the result, i can see 70% of the original html code, but in the middle somehow i couldn't find few division tags, which is having my desired data. 清楚地提到了在其官方网站中使用的正确的userAgent,但是结果是,我可以看到原始html代码的70%,但是在中间,我却找不到几个除法标记,它具有我想要的数据。

i tried tried..... no use... why few div tags are missing from the doc. 我试过尝试.....没用...为什么文档中缺少几个div标签。

you can directly take the url and put it into your browser, if you are logged into facebook, you can see the response as: " No results found for your query. Check your spelling or try another term." 您可以直接将URL放入浏览器中,如果登录到Facebook,则响应将显示为:“未找到查询结果。请检查拼写或尝试其他术语。” this is what i am looking for when jsoup parse html of the above mentioned url. 这是我在jsoup解析上述URL的html时要寻找的东西。

But unfortunately, this part is missing.actually this response is in div id: "#pagelet_search_no_results". 但是很遗憾,这部分丢失了。实际上,此响应位于div id中:“#pagelet_search_no_results”。 i could not find the div with this id in the parsed html. 我在解析的html中找不到具有此ID的div。 I tried with lot of methods available from jsoup, but no luck. 我尝试了jsoup提供的许多方法,但是没有运气。

As far as i know Jsoup restricts the size of the retrieved content to 1M usually. 据我所知,Jsoup通常将检索到的内容的大小限制为1M。 Try this to get the full html source: 尝试此操作以获取完整的html源:

Document document = Jsoup.connect(url)
  .userAgent("Mozilla/5.0 (Windows NT 6.2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/30.0.1599.69 Safari/537.36")
  .maxBodySize(0)
  .get();

The maxBodySize(0) removes the 1M limit. maxBodySize(0)删除1M限制。 There are other useful parameters you can set in the connect, like a timeout or cookies. 您可以在连接中设置其他有用的参数,例如超时或cookie。

You should also set a large timeout, ex.: 您还应该设置较大的超时时间,例如:

Document document = Jsoup.connect(url)
.header("Accept-Encoding", "gzip, deflate")
.userAgent("Mozilla/5.0 (Windows NT 6.1; WOW64; rv:23.0) Gecko/20100101 Firefox/23.0")
.maxBodySize(0)
.timeout(600000)
.get();

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM