简体   繁体   English

使用Jsoup获取网站-页面视图源和Jsoup显示不同的内容

[英]Fetching the website with Jsoup - page view source and Jsoup shows different content

I use Jsoup to scrap the website: 我使用Jsoup抓取该网站:

doc = Jsoup.connect(String.valueOf(urls[0])).userAgent("Mozilla").get();    

Here is the link: 链接在这里:

http://www.yelp.com/search?find_desc=restaurant&find_loc=willowbrook%2C+IL&ns=1#l=p:IL:Willowbrook::&sortby=rating&rpp=40 http://www.yelp.com/search?find_desc=restaurant&find_loc=willowbrook%2C+IL&ns=1#l=p:IL:Willowbrook::&sortby=rating&rpp=40

I have added rpp=40 parameter to the link in the command line to display 40 results per page. 我在命令行的链接中添加了rpp = 40参数,以每页显示40个结果。 I'm able to see all the results in page view source. 我可以在页面视图源中查看所有结果。 I know that Jsoup is for the static content only and cannot fetch the websites that use AJAX/JS Libraries technique to generate content. 我知道Jsoup仅用于静态内容,无法获取使用AJAX / JS库技术生成内容的网站。 However why Jsoup cannot retrieve the same content as I can see in the browser via page view source? 但是,为什么Jsoup无法检索与通过页面视图源在浏览器中看到的内容相同的内容? Page view source shows 40 results whereas Jsoup is able to retrieve elements from only 10 results? 页面视图源显示40个结果,而Jsoup只能从10个结果中检索元素? How can I obtain every elements visible via page view source. 如何获得通过页面视图源可见的每个元素。

Short answer Jsoup can't execute the Javascript. 简短答案 Jsoup无法执行Javascript。

Long answer 长答案

http://www.yelp.com/search?find_desc=restaurant&find_loc=willowbrook%2C+IL&ns=1#l=p:IL:Willowbrook::&sortby=rating&rpp=40

The webpage your are looking for accepts the Http Get with the parameters. 您正在寻找的网页接受带有参数的Http Get。 In the normal browser it accepts the params and loads the page . 在普通浏览器中,它接受参数并加载页面。 But Not with willowbrook checked (in your example). 不与Willowbrook一起检查 (在您的示例中)。 It loads the JS after it loads the page and the Javascript does the check box for Fliters the serach results. 它会在加载页面后加载JS,而Javascript会选中Fliters搜索结果的复选框。 Therefore when you use Jsoup you are getting more results because it loads 'state=IL' without 'willowbrook' filtered. 因此,当您使用Jsoup时,您会得到更多的结果,因为它加载了“ state = IL”而未过滤“ willowbrook”。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM