简体   繁体   English

在启用 Javascript 的情况下抓取网站?

[英]Scraping websites with Javascript enabled?

I'm trying to scrape and submit information to websites that heavily rely on Javascript to do most of its actions.我正在尝试抓取信息并将其提交给严重依赖 Javascript 执行大部分操作的网站。 The website won't even work when i disable Javascript in my browser.当我在浏览器中禁用 Javascript 时,该网站甚至无法运行。

I've searched for some solutions on Google and SO and there was someone who suggested i should reverse engineer the Javascript, but i have no idea how to do that.我在 Google 和 SO 上搜索了一些解决方案,有人建议我应该对 Javascript 进行逆向工程,但我不知道该怎么做。

So far i've been using Mechanize and it works on websites that don't require Javascript.到目前为止,我一直在使用 Mechanize,它可以在不需要 Javascript 的网站上运行。

Is there any way to access websites that use Javascript by using urllib2 or something similar?有没有办法通过使用 urllib2 或类似的东西来访问使用 Javascript 的网站? I'm also willing to learn Javascript, if that's what it takes.如果需要的话,我也愿意学习 Javascript。

I wrote a small tutorial on this subject, this might help:我写了一个关于这个主题的小教程,这可能会有所帮助:

http://koaning.io.s3-website.eu-west-2.amazonaws.com/dynamic-scraping-with-python.html http://koaning.io.s3-website.eu-west-2.amazonaws.com/dynamic-scraping-with-python.html

Basically what you do is you have the selenium library pretend that it is a firefox browser, the browser will wait until all javascript has loaded before it continues passing you the html string.基本上你所做的是让 selenium 库假装它是一个 Firefox 浏览器,浏览器将等到所有 javascript 都加载完毕,然后再继续向你传递 html 字符串。 Once you have this string, you can then parse it with beautifulsoup.一旦你有了这个字符串,你就可以用beautifulsoup解析它。

You should look into using Ghost , a Python library that wraps the PyQt4 + WebKit hack.您应该考虑使用Ghost ,这是一个包含 PyQt4 + WebKit hack 的 Python 库。

This makes g the WebKit client:这使得g成为 WebKit 客户端:

import ghost
g = ghost.Ghost()

You can grab a page with g.open(url) and then g.content will evaluate to the document in its current state.您可以使用g.open(url)抓取页面,然后g.content将评估当前状态下的文档。

Ghost has other cool features, like injecting JS and some form filling methods, and you can pass the resulting document to BeautifulSoup and so on: soup = bs4.BeautifulSoup(g.content) . Ghost 还有其他很酷的特性,比如注入 JS 和一些表单填充方法,你可以将生成的文档传递给 BeautifulSoup 等等: soup = bs4.BeautifulSoup(g.content)

So far, Ghost is the only thing I've found that makes this kind of thing easy in Python.到目前为止,Ghost 是我发现的唯一使 Python 中的这种事情变得容易的东西。 The only limitation I've come across is that you can't easily create more than one instance of the client object, ghost.Ghost , but you could work around that.我遇到的唯一限制是您不能轻松地创建多个客户端对象的实例ghost.Ghost ,但您可以解决这个问题。

I've had exactly the same problem.我遇到了完全相同的问题。 It is not simple at all, but I finally found a great solution, using PyQt4.QtWebKit .这一点都不简单,但我终于找到了一个很好的解决方案,使用PyQt4.QtWebKit

You will find the explanations on this webpage : http://blog.motane.lu/2009/07/07/downloading-a-pages-content-with-python-and-webkit/你会在这个网页上找到解释: http : //blog.motane.lu/2009/07/07/downloading-a-pages-content-with-python-and-webkit/

I've tested it, I currently use it, and that's great !我已经测试过了,我目前正在使用它,这很棒!

Its great advantage is that it can run on a server, only using X, without a graphic environment.它的一大优点是可以在服务器上运行,只使用X,没有图形环境。

Check out crowbar .检查撬棍 I haven't had any experience with it, but I was curious about the answer to your question so I started googling around.我没有任何经验,但我对你的问题的答案很好奇,所以我开始在谷歌上搜索。 I'd like to know if this works out for you.我想知道这是否适合你。

http://grep.codeconsult.ch/2007/02/24/crowbar-scrape-javascript-generated-pages-via-gecko-and-rest/ http://grep.codeconsult.ch/2007/02/24/crowbar-scrape-javascript-generated-pages-via-gecko-and-rest/

Maybe you could use Selenium Webdriver , which has python bindings I believe.也许您可以使用Selenium Webdriver ,我相信它具有 python 绑定。 I think it's mainly used as a tool for testing websites, but I guess it should be usable for scraping too.我认为它主要用作测试网站的工具,但我想它也应该可用于抓取。

I would actually suggest using Selenium.我实际上建议使用硒。 Its mainly designed for testing Web-Applications from a "user perspective however it is basically a "FireFox" driver. I've actually used it for this purpose ... although I was scraping an dynamic AJAX webpage. As long as the Javascript form has a recognizable "Anchor Text" that Selenium can "click" everything should sort itself out.它主要是为了从“用户角度”测试 Web 应用程序而设计的,但它基本上是一个“FireFox”驱动程序。我实际上已经将它用于此目的......尽管我正在抓取动态 AJAX 网页。只要 Javascript 表单有一个可识别的“锚文本”,Selenium 可以“单击”所有应该自己整理的内容。

Hope that helps希望有帮助

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM