简体繁体 English

用Python中的java脚本抓取网页

[英]Scraping a web page with java script in Python

原文 2011-03-17 12:26:35 9 3 javascript/ python/ python-3.x/ web-scraping

i'm working in python 3.2 (newb) on windows machine (though i have ubuntu 10.04 on virtual box if needed, but i prefer to work on the windows machine).我正在 Windows 机器上使用 python 3.2 (newb)（尽管如果需要，我在虚拟机上有 ubuntu 10.04，但我更喜欢在 Windows 机器上工作）。

Basically i'm able to work with the http module and urlib module to scrape web pages, but only those that don't have java script document.write("<div....") and the like that adds data that is not there while i get the actual page (meaning without real ajax scripts).基本上我可以使用 http 模块和 urlib 模块来抓取网页，但只有那些没有 java 脚本 document.write("<div....") 等添加数据的在我获得实际页面时不在那里（意思是没有真正的 ajax 脚本）。

To process those kind of sites as well i'm pretty sure i need a browser java script processor to work on the page and give me an output with the final result, hopefully as a dict or text.为了处理这些类型的网站，我很确定我需要一个浏览器 java 脚本处理器来处理页面并给我一个带有最终结果的输出，希望是一个 dict 或文本。

I tried to compile python-spider monkey but i understand that it's not for windows and it's not working with python 3.x :-?我试图编译 python-spider 猴子，但我知道它不适用于 Windows 并且它不适用于 python 3.x :-?

Any suggestions ?有什么建议？ if anyone did something like that before i'll appreciate the help!如果有人在此之前做过类似的事情，我将不胜感激！

3 个解决方案

I recommend python's bindings to the webkit library - here is an example .我推荐 python 对 webkit 库的绑定—— 这里是一个例子。 Webkit is cross platform and is used to render webpages in Chrome and Safari. Webkit 是跨平台的，用于在 Chrome 和 Safari 中呈现网页。 An excellent library.一个优秀的图书馆。

Use Firebug to see exactly what is being called to get the data to display (a POST or GET url?).使用 Firebug 可以准确查看为获取要显示的数据而调用的内容（POST 或 GET url？）。 I suspect there's an AJAX call that's retrieving the data from the server either as XML or JSON.我怀疑有一个 AJAX 调用从服务器以 XML 或 JSON 形式检索数据。 Just call the same AJAX call, and parse the data yourself.只需调用相同的 AJAX 调用，并自己解析数据。

Optionally, you can download Selenium for Firefox, start a Selenium server, download the page via Selenium, and get the DOM contents.或者，您可以下载 Selenium for Firefox，启动 Selenium 服务器，通过 Selenium 下载页面，并获取 DOM 内容。 MozRepl works as well, but doesn't have as much documentation since it's not widely used. MozRepl 也能工作，但没有那么多文档，因为它没有被广泛使用。

document.write is usually used because you are generating the content on the fly, often by fetching data from a server.通常使用 document.write 是因为您正在动态生成内容，通常是通过从服务器获取数据。 What you get are web apps that are more about javascript than HTML.你得到的是更多关于 javascript 而不是 HTML 的网络应用程序。 "Scraping" is rather more a question of downloading HTML and processing it, but here there isn't any HTML to download. “抓取”更像是下载 HTML 并处理它的问题，但这里没有任何 HTML 可供下载。 You are essentially trying to scrape a GUI program.您实际上是在尝试抓取 GUI 程序。

Most of these applications have some sort of API, often returning XML or JSON data, that you can use instead.大多数这些应用程序都有某种 API，通常返回 XML 或 JSON 数据，您可以使用它们。 If it doesn't, your should probably try to remote control a real webbrowser instead.如果没有，您可能应该尝试远程控制真正的网络浏览器。