简体   繁体   中英

How to scrape HTTPS javascript web pages

I am trying to monitor day-to-day prices from an online catalogue. The site uses HTTPS and generates the catalogue pages with javascript. How can i interface with the site and make it generate the pages I need?

I have done this with other sites where the HTML can easily be accessed, I have no problem parseing the HTML once generated.

I only know Python and Java.

Thanks in advance.

Take a look at HTMLUnit - a headless Java browser that can be fully controlled by your code. A simple example can be seen here: http://htmlunit.sourceforge.net/gettingStarted.html

(obligatory warning: by screen-scraping the site, you may be breaking its ToS, and possibly open yourself to lawsuits; check whether you are allowed to do it before you start)

If they've created a Web API that their JavaScript interfaces with, you might be able to scrape that directly, rather than trying to go the HTML route.

If they've obfuscated it or that option isn't available for some other reason, you'll basically need a Web browser to evaluate the JavaScript and then scrap the browser's DOM. Perhaps write a browser plugin?

I use webkit through it's python bindings for scraping javascript content. See here for example .

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM