简体   繁体   中英

Scraping a web page with java script in Python

i'm working in python 3.2 (newb) on windows machine (though i have ubuntu 10.04 on virtual box if needed, but i prefer to work on the windows machine).

Basically i'm able to work with the http module and urlib module to scrape web pages, but only those that don't have java script document.write("<div....") and the like that adds data that is not there while i get the actual page (meaning without real ajax scripts).

To process those kind of sites as well i'm pretty sure i need a browser java script processor to work on the page and give me an output with the final result, hopefully as a dict or text.

I tried to compile python-spider monkey but i understand that it's not for windows and it's not working with python 3.x :-?

Any suggestions ? if anyone did something like that before i'll appreciate the help!

I recommend python's bindings to the webkit library - here is an example . Webkit is cross platform and is used to render webpages in Chrome and Safari. An excellent library.

Use Firebug to see exactly what is being called to get the data to display (a POST or GET url?). I suspect there's an AJAX call that's retrieving the data from the server either as XML or JSON. Just call the same AJAX call, and parse the data yourself.

Optionally, you can download Selenium for Firefox, start a Selenium server, download the page via Selenium, and get the DOM contents. MozRepl works as well, but doesn't have as much documentation since it's not widely used.

document.write is usually used because you are generating the content on the fly, often by fetching data from a server. What you get are web apps that are more about javascript than HTML. "Scraping" is rather more a question of downloading HTML and processing it, but here there isn't any HTML to download. You are essentially trying to scrape a GUI program.

Most of these applications have some sort of API, often returning XML or JSON data, that you can use instead. If it doesn't, your should probably try to remote control a real webbrowser instead.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM