简体   繁体   中英

Web crawler using JSOUP- cannot fetch custom tags

I am trying to build a web crawler using JSOUP. The problem is, while it works for other pages, it is not able to crawl Swiggy data. I think this is due to the presence of a custom tag restaurant-menu . I do this:

Document document = Jsoup.connect(url).get();
Elements document_body = document.select(".layout-wrapper");
System.out.println(document_body.html());

And in the output, I get this:

<div class="restaurant-menu-container"> <restaurant-menu></restaurant-menu> </div>

The restaurant-menu tag is empty while if you visit the website and inspect its content, the entire data is present in the restaurant-menu tag:

Is it due to the custom tags or is there some other reason?

Reading the content of the restaurant-menu is simple:

document.select("div.restaurant-menu-container.restaurant-menu")

But ... when you use JSoup (just like if you browse to the page and View Source) you'll find that there is no content. This is because JSoup parses static HTML content and the content of div.restaurant-menu-container.restaurant-menu is created dynamically.

JSoup cannot parse dynamic content, if you want to - programmatically - extract dynamic content then you'll likely need to look at something like Selenium .

Here is a tutorial explaining how to crawl with StormCrawler + Selenium. SC uses Jsoup under the bonnet for parsing the HTML documents and you could write extraction rules based on XPath but also interact with the page dynamically via NavigationFilters.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM