[英]Web scraping a website with dynamic javascript content
So I'm using python and beautifulsoup4(which i'm not tied to) to scrape a website. 所以我使用python和beautifulsoup4(我没有绑定)来刮网站。 Problem is when I use urlib to grab the html of a page it's not the entire page because some of it is generated via the javascript.
问题是当我使用urlib抓取页面的html时,它不是整个页面,因为其中一些是通过javascript生成的。 Is there any way to get around this?
有没有办法解决这个问题?
There are basically two main options to proceed with: 基本上有两个主要选项可供选择:
The first option is more difficult to implement and it's, generally speaking, more fragile, but it doesn't require a real browser and can be faster. 第一种选择更难实现,一般来说,它更脆弱,但它不需要真正的浏览器,而且速度更快。
The second option is better in terms of you get what any other real user gets and you wouldn't be worried about how the page was loaded. 第二个选项在获得任何其他真实用户获得的内容方面更好,您不会担心页面的加载方式。 Selenium is pretty powerful in locating elements on a page - you may not need
BeautifulSoup
at all. Selenium在查找页面上的元素方面非常强大 - 您可能根本不需要
BeautifulSoup
。 But, anyway, this option is slower than the first one. 但是,无论如何,这个选项比第一个慢。
Hope that helps. 希望有所帮助。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.