简体   繁体   English

使用JSOUP的Web搜寻器-无法获取自定义标签

[英]Web crawler using JSOUP- cannot fetch custom tags

I am trying to build a web crawler using JSOUP. 我正在尝试使用JSOUP构建Web搜寻器。 The problem is, while it works for other pages, it is not able to crawl Swiggy data. 问题是,尽管它可用于其他页面,但无法抓取Swiggy数据。 I think this is due to the presence of a custom tag restaurant-menu . 我认为这是由于存在自定义标签restaurant-menu I do this: 我这样做:

Document document = Jsoup.connect(url).get();
Elements document_body = document.select(".layout-wrapper");
System.out.println(document_body.html());

And in the output, I get this: 在输出中,我得到以下信息:

<div class="restaurant-menu-container"> <restaurant-menu></restaurant-menu> </div>

The restaurant-menu tag is empty while if you visit the website and inspect its content, the entire data is present in the restaurant-menu tag: restaurant-menu标签为空,而如果您访问该网站并检查其内容,则全部数据将显示在restaurant-menu标签中:

Is it due to the custom tags or is there some other reason? 是由于自定义标签还是其他原因?

Reading the content of the restaurant-menu is simple: 读取restaurant-menu的内容很简单:

document.select("div.restaurant-menu-container.restaurant-menu")

But ... when you use JSoup (just like if you browse to the page and View Source) you'll find that there is no content. 但是...当您使用JSoup时(就像浏览页面和查看源代码一样),您会发现没有任何内容。 This is because JSoup parses static HTML content and the content of div.restaurant-menu-container.restaurant-menu is created dynamically. 这是因为JSoup解析静态HTML内容,并且div.restaurant-menu-container.restaurant-menu是动态创建的。

JSoup cannot parse dynamic content, if you want to - programmatically - extract dynamic content then you'll likely need to look at something like Selenium . JSoup无法解析动态内容,如果要以编程方式提取动态内容,则可能需要查看Selenium之类的内容

Here is a tutorial explaining how to crawl with StormCrawler + Selenium. 这是说明如何使用StormCrawler + Selenium进行爬网的教程 SC uses Jsoup under the bonnet for parsing the HTML documents and you could write extraction rules based on XPath but also interact with the page dynamically via NavigationFilters. SC在引擎盖下使用Jsoup来解析HTML文档,您可以编写基于XPath的提取规则,也可以通过NavigationFilters动态地与页面进行交互。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM