简体   繁体   中英

Get html document programmaticaly simulating a web browser

The thing is that I'm trying to get an html document with Jsoup class and I realize that the doc I get using Jsoup.connect is not exactly similar to the doc I get if I directly download it with a web browser.

Example:
I want to monitor prices of an article. I get the html documents of "Icecat" using:

Jsoup.connect( "http://icecat.es/es/p/sony/mdr-as200-blk/auriculares-0027242861022-Sony-MDR-AS200-18145805.html?ti=offers")
     .userAgent(userAgentString).timeout(5000)   
     .followRedirects(true).execute();

( userAgentString : I tried with different ones)

But the document I get doesn't have the pricing information, the tab with the info appears "inactive".
Ohterwise, if I try to download it using any web browser, the page directly shows the prices table.

Bonus question

I get the same behaviour trying to get google's result page. Typing directly in the web browser https://www.google.com/webhp?sourceid=chrome-instant&ion=1&espv=2&ie=UTF-8#tbm=shop&q=Sony+MDR-AS200 is ok, but getting it with java I'm redirected to google's home page. I know google's TOS, but I don't want to do a massive parsing.

Jsoup does not execute JavaScript. If the site you try to get uses some AJAX calls to and creates (part of) the DOM dynamically you are out of luck with Jsoup.

You may use selenium webdriver for that, or try to find the AJAX calls and trigger them directly.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM