简体繁体 English

使用BeautifulSoup并使用不变的网址来抓取网站

[英]Scraping website using BeautifulSoup with unchanging URL

原文 2015-07-06 21:21:07 5 1 python/ web-scraping/ beautifulsoup

I've webscraped before but I'm running into some issues I haven't seen before when trying to scrape from RottenTomatoes/search. 我之前曾进行过网络抓取，但在尝试从RottenTomatoes / search抓取时遇到了一些我从未见过的问题。 The issue is twofold. 问题是双重的。 (I'm waiting for my API request to be 'validated', and Rotten Tomatoes doesn't have a list of all movies unfortunately, ugh) （我正在等待API请求“验证”，不幸的是，烂番茄没有所有电影的列表，嗯）

There's a "More Movies" link on the bottom right of the page that has to be "clicked" to bring up the movies. 页面右下角有一个“更多电影”链接，必须单击该链接才能播放电影。 As far as I know, Python doesn't have something like that to interact with that... or does it? 据我所知，Python没有这样的东西可以与之交互……或者是吗？

在此处输入图片说明

Even when the "More Movies" link is clicked, the URL at the top doesn't change when I'm trying to navigate/iterate through all the pages. 即使单击“更多电影”链接，当我尝试浏览/迭代所有页面时，顶部的URL也不会更改。 This seems like a problem for BeautifulSoup. 对于BeautifulSoup来说，这似乎是一个问题。

在此处输入图片说明

Any recommendations/tips? 有什么建议/提示吗？

1 个解决方案

This is indeed not directly possible by using beautifulsoup, since beautifulsoup handles static webpages. 实际上，使用beautifulsoup不可能直接做到这一点，因为beautifulsoup处理静态网页。 The content you want to crawl is being added to the page via JavaScript, rather than baked into the HTML. 您要抓取的内容是通过JavaScript添加到页面中的，而不是烘焙到HTML中的。

The button 'More movies' calls a javascript function that will probably have some AJAX function call for more movies. “更多电影”按钮调用了一个javascript函数，该函数可能会针对更多电影进行一些AJAX函数调用。

There are a few scenario's where you can access the 'more movies' easily: 在某些情况下，您可以轻松访问“更多电影”：

sometimes the data is already present in the source, but hidden. 有时数据已经存在于源中，但是被隐藏了。 The javascript makes this visible JavaScript使此可见
The javascript uses a api to load its content, this API url can then be found in the source code and you can find what you are looking for if you go directly to that link. javascript使用api加载其内容，然后可以在源代码中找到此API url，如果直接转到该链接，则可以找到所需内容。

However none of the above seems to be the case for rottentomatoes. 但是，上述似乎与轮状番茄都不是一样。 I gave it a quick look, perhaps you should investigate it more thorough. 我快速浏览了一下，也许您应该对其进行更彻底的调查。

A solution I've used in the past is: Selenium . 我过去使用的解决方案是： Selenium 。 This has a python library that is easy to use and that allows you to automate browser behaviour. 它有一个易于使用的python库，可让您自动执行浏览器行为。 This way you can 'automatically' click the load more button while crawling. 这样，您可以在爬网时“自动”单击“加载更多”按钮。

Beware however, this might be slow and cost resources. 但是请注意 ，这可能会很慢并且会浪费资源。 You can run it headless, which makes it not open a browser and saves some of the resources. 您可以无头运行它，这使其无法打开浏览器并节省一些资源。