简体繁体 English

如何使用BeautifulSoup抓取用javascript生成的数据？

[英]How do I scrape data generated with javascript using BeautifulSoup?

原文 2018-01-23 01:54:07 7 1 javascript/ python/ json/ web-scraping/ beautifulsoup

I'm trying to migrate some comments from a blog using web scraping with python and BeautifulSoup. 我正在尝试使用python和BeautifulSoup的网络抓取功能从博客中迁移一些评论。 The content I'm looking for isn't in the HTML itself and seems to have been generated in a script tag (which I can't find). 我要查找的内容不在HTML本身中，并且似乎是在script标签（我找不到）中生成的。 I've seen some answers regarding this but most of them are specific to a certain problem and I can't seem to figure out how to apply it to my site. 我已经看到了一些有关此问题的答案，但是大多数答案都是特定于某个问题的，我似乎无法弄清楚如何将其应用于我的网站。 I'm just trying to scrape comments from pages like this one: 我只是想从这样的页面中抓取评论：

http://www.themasterpiececards.com/famous-paintings-reviewed/bid/92327/famous-paintings-duccio-s-maesta http://www.themasterpiececards.com/famous-paintings-reviewed/bid/92327/famous-paintings-duccio-s-maesta

I've also tried Selenium, but I'm using a Cloud9-based IDE currently and it doesn't seem to support web drivers. 我也尝试过Selenium，但是我目前正在使用基于Cloud9的IDE，它似乎不支持Web驱动程序。

I apologize if I botched any of the lingo, I'm pretty new to programming. 如果我搞砸了任何术语，我深表歉意，我是编程新手。 If anyone has any tips, that would be helpful. 如果有人有任何提示，那将有所帮助。 Thanks! 谢谢！

1 个解决方案

You have many ways to scrap such content. 您有很多方法可以删除此类内容。 One would be to find out how comments are loaded on this website. 一种是找出如何在此网站上加载评论。 On quick lookup in chromium developer tools, comments for the page mentioned are loaded via this api call. 在Chrome开发人员工具中快速查找时，通过此 api调用会加载针对该页面的注释。

This may not be a suitable way for you as you may not generate this url for every different page. 这可能不适合您，因为您可能不会为每个不同的页面生成此URL。

Another more reliable way would be to render such js content using GUIless browser, for ease of implementation i would suggest using scrapy with splash .Splash is a python framework which renders most of the content for your requests. 另一种更可靠的方法是使用无GUI浏览器呈现此类js内容，为便于实现，我建议使用scrapy with splash .Splash是一个python框架，可为您的请求呈现大部分内容。