简体繁体 English

Python搜寻器。解析和执行ajax

[英]Python crawler. Parsing and executing ajax

原文 2012-01-25 18:57:57 9 1 python/ ajax/ web-scraping/ web-crawler

I have a basic structure set up for a crawler. 我为爬虫设置了基本结构。 Now I released it on some php driven websites and it works like a charm. 现在，我在一些php驱动的网站上发布了它，它的工作原理很吸引人。 Though now I want to it to build datasheets from ajax content. 尽管现在我希望它可以从ajax内容构建数据表。

At the moment I am using Mechanize for PYTHON and perl to build my crawler. 目前，我正在使用Mechanize for PYTHON和perl来构建我的搜寻器。 Though Mechanize module does not execute AJAX. 虽然机械化模块不执行AJAX。 How do i get to the content that is build by asynchronomous ajax? 我如何获得异步ajax构建的内容？

I know there is something called Selenium, a real browser to automate. 我知道有一个叫做Selenium的东西，一个真正的浏览器可以自动化。 But is this my only option? 但这是我唯一的选择吗？

1 个解决方案

Your can run a headless browser eg phantomjs which understands JavaScript, DOM etc but you will have to write your code in Javascript, benefit is that you can do whatever you want. 您可以运行无头浏览器（例如phantomjs），该浏览器可以理解JavaScript，DOM等，但是您必须使用Javascript编写代码，其好处是您可以做任何想做的事情。

There is another way but its messy . 还有另一种方法，但是它很messy 。

You could observe what requests are made when you click the button (using Firebug in Firefox or Developer Tools in Chrome). 单击按钮（使用Firefox中的Firebug或Chrome中的Developer Tools）时，您可以观察到发出了哪些请求。 Than try to reverse engineer the javascript running behind the page and try to do the similar thing using your python code, for that take a look at Spidermonkey 比尝试对页面后面运行的javascript进行反向工程，并尝试使用您的python代码做类似的事情，为此，请看一下Spidermonkey