[英]Web crawler using JSOUP- cannot fetch custom tags
I am trying to build a web crawler using JSOUP. 我正在尝试使用JSOUP构建Web搜寻器。 The problem is, while it works for other pages, it is not able to crawl Swiggy data.
问题是,尽管它可用于其他页面,但无法抓取Swiggy数据。 I think this is due to the presence of a custom tag
restaurant-menu
. 我认为这是由于存在自定义标签
restaurant-menu
。 I do this: 我这样做:
Document document = Jsoup.connect(url).get();
Elements document_body = document.select(".layout-wrapper");
System.out.println(document_body.html());
And in the output, I get this: 在输出中,我得到以下信息:
<div class="restaurant-menu-container"> <restaurant-menu></restaurant-menu> </div>
The restaurant-menu
tag is empty while if you visit the website and inspect its content, the entire data is present in the restaurant-menu
tag: restaurant-menu
标签为空,而如果您访问该网站并检查其内容,则全部数据将显示在restaurant-menu
标签中:
Is it due to the custom tags or is there some other reason? 是由于自定义标签还是其他原因?
Reading the content of the restaurant-menu
is simple: 读取
restaurant-menu
的内容很简单:
document.select("div.restaurant-menu-container.restaurant-menu")
But ... when you use JSoup (just like if you browse to the page and View Source) you'll find that there is no content. 但是...当您使用JSoup时(就像浏览页面和查看源代码一样),您会发现没有任何内容。 This is because JSoup parses static HTML content and the content of
div.restaurant-menu-container.restaurant-menu
is created dynamically. 这是因为JSoup解析静态HTML内容,并且
div.restaurant-menu-container.restaurant-menu
是动态创建的。
JSoup cannot parse dynamic content, if you want to - programmatically - extract dynamic content then you'll likely need to look at something like Selenium . JSoup无法解析动态内容,如果要以编程方式提取动态内容,则可能需要查看Selenium之类的内容 。
Here is a tutorial explaining how to crawl with StormCrawler + Selenium. 这是说明如何使用StormCrawler + Selenium进行爬网的教程 。 SC uses Jsoup under the bonnet for parsing the HTML documents and you could write extraction rules based on XPath but also interact with the page dynamically via NavigationFilters.
SC在引擎盖下使用Jsoup来解析HTML文档,您可以编写基于XPath的提取规则,也可以通过NavigationFilters动态地与页面进行交互。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.