简体繁体 English

Web使用动态javascript内容抓取网站

[英]Web scraping a website with dynamic javascript content

原文 2014-03-28 14:03:50 1 1 javascript/ python/ web-scraping/ beautifulsoup/ html-parsing

So I'm using python and beautifulsoup4(which i'm not tied to) to scrape a website. 所以我使用python和beautifulsoup4（我没有绑定）来刮网站。 Problem is when I use urlib to grab the html of a page it's not the entire page because some of it is generated via the javascript. 问题是当我使用urlib抓取页面的html时，它不是整个页面，因为其中一些是通过javascript生成的。 Is there any way to get around this? 有没有办法解决这个问题？

1 个解决方案

There are basically two main options to proceed with: 基本上有两个主要选项可供选择：

using browser developer tools, see what ajax requests are going to load the page and simulate them in your script, you will probably need to use json module to load the response json string into python data structure 使用浏览器开发人员工具，查看ajax请求将加载页面并在脚本中模拟它们，您可能需要使用json模块将响应json字符串加载到python数据结构中
use tools like selenium that open up a real browser. 使用像selenium这样的工具打开一个真正的浏览器。 The browser can also be "headless", see Headless Selenium Testing with Python and PhantomJS 浏览器也可以“无头”，请参阅使用Python和PhantomJS的Headless Selenium Testing

The first option is more difficult to implement and it's, generally speaking, more fragile, but it doesn't require a real browser and can be faster. 第一种选择更难实现，一般来说，它更脆弱，但它不需要真正的浏览器，而且速度更快。

The second option is better in terms of you get what any other real user gets and you wouldn't be worried about how the page was loaded. 第二个选项在获得任何其他真实用户获得的内容方面更好，您不会担心页面的加载方式。 Selenium is pretty powerful in locating elements on a page - you may not need BeautifulSoup at all. Selenium在查找页面上的元素方面非常强大 - 您可能根本不需要BeautifulSoup 。 But, anyway, this option is slower than the first one. 但是，无论如何，这个选项比第一个慢。