简体   繁体   中英

Can't scrape non-html elements

I'm trying to scrape search results from a number of websites. The problem is that not all of these sites return their search results as plain html text, a lot of it is dynamically generated with with JS, AJAX, etc. However, I can see exactly what I need by looking at the page with the Firefox inspector, since the scripts have all run and modified the html.

My question is: is there a way for me to download a webpage AFTER allowing the scripts to run, or at least get them to run locally. That way, I'd get the final html.

For reference, I'm using python.

Possible duplicate . In that case the question is with php and JS.

Sure, you have to provide some enviroment for scripts (js) to run and often to return a test value to target server. It's not that easy for the server side languages. So today for this we mostly leverage browser driving or imitating tools mentioned there.

I've found for you the python analog to v8js php plugin : PyV8 .

PyV8 is a python wrapper for Google V8 engine, it act as a bridge between the Python and JavaScript objects, and support to hosting Google's v8 engine in a python script.

If properly configured, your scraper:

  1. Gets site's js
  2. Evaluates this js thru the given plugin
  3. Gets access to target html for further parse.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM