简体繁体中英

Can't scrape non-html elements

原文 2015-03-13 07:47:30 2 1 python/ html/ web-scraping

I'm trying to scrape search results from a number of websites. The problem is that not all of these sites return their search results as plain html text, a lot of it is dynamically generated with with JS, AJAX, etc. However, I can see exactly what I need by looking at the page with the Firefox inspector, since the scripts have all run and modified the html.

My question is: is there a way for me to download a webpage AFTER allowing the scripts to run, or at least get them to run locally. That way, I'd get the final html.

For reference, I'm using python.

1 answers

Possible duplicate . In that case the question is with php and JS.

Sure, you have to provide some enviroment for scripts (js) to run and often to return a test value to target server. It's not that easy for the server side languages. So today for this we mostly leverage browser driving or imitating tools mentioned there.

I've found for you the python analog to v8js php plugin : PyV8 .

PyV8 is a python wrapper for Google V8 engine, it act as a bridge between the Python and JavaScript objects, and support to hosting Google's v8 engine in a python script.

If properly configured, your scraper:

Gets site's js
Evaluates this js thru the given plugin
Gets access to target html for further parse.

Find all HTML and non-HTML encoded URLs in string

Distinguishing between HTML and non-HTML pages in Scrapy

generating non-html output from google's app engine

Plain (non-HTML) error pages in REST api

Can't scrape all elements with beautifulsoup

Grabbing non-HTML data from a website using python

Is it possible to include a hyperlink in a non-html (outlook) email on python?

Can't scrape HTML table using BeautifulSoup

Can't scrape nested html using BeautifulSoup

Can't scrape all HTML from Airbnb

暂无

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

Related Question Find all HTML and non-HTML encoded URLs in string Distinguishing between HTML and non-HTML pages in Scrapy generating non-html output from google's app engine Plain (non-HTML) error pages in REST api Can't scrape all elements with beautifulsoup Grabbing non-HTML data from a website using python Is it possible to include a hyperlink in a non-html (outlook) email on python? Can't scrape HTML table using BeautifulSoup Can't scrape nested html using BeautifulSoup Can't scrape all HTML from Airbnb

Related Tags

Can't scrape non-html elements

Question

1 answers

solution1 0 2015-03-13 09:04:41

solution1
0 2015-03-13 09:04:41