[英]Scraping website using BeautifulSoup with unchanging URL
I've webscraped before but I'm running into some issues I haven't seen before when trying to scrape from RottenTomatoes/search. 我之前曾进行过网络抓取,但在尝试从RottenTomatoes / search抓取时遇到了一些我从未见过的问题。 The issue is twofold.
问题是双重的。 (I'm waiting for my API request to be 'validated', and Rotten Tomatoes doesn't have a list of all movies unfortunately, ugh)
(我正在等待API请求“验证”,不幸的是,烂番茄没有所有电影的列表,嗯)
Any recommendations/tips? 有什么建议/提示吗?
This is indeed not directly possible by using beautifulsoup, since beautifulsoup handles static webpages. 实际上,使用beautifulsoup不可能直接做到这一点,因为beautifulsoup处理静态网页。 The content you want to crawl is being added to the page via JavaScript, rather than baked into the HTML.
您要抓取的内容是通过JavaScript添加到页面中的,而不是烘焙到HTML中的。
The button 'More movies' calls a javascript function that will probably have some AJAX function call for more movies. “更多电影”按钮调用了一个javascript函数,该函数可能会针对更多电影进行一些AJAX函数调用。
There are a few scenario's where you can access the 'more movies' easily: 在某些情况下,您可以轻松访问“更多电影”:
However none of the above seems to be the case for rottentomatoes. 但是,上述似乎与轮状番茄都不是一样。 I gave it a quick look, perhaps you should investigate it more thorough.
我快速浏览了一下,也许您应该对其进行更彻底的调查。
A solution I've used in the past is: Selenium . 我过去使用的解决方案是: Selenium 。 This has a python library that is easy to use and that allows you to automate browser behaviour.
它有一个易于使用的python库,可让您自动执行浏览器行为。 This way you can 'automatically' click the load more button while crawling.
这样,您可以在爬网时“自动”单击“加载更多”按钮。
Beware however, this might be slow and cost resources. 但是请注意 ,这可能会很慢并且会浪费资源。 You can run it headless, which makes it not open a browser and saves some of the resources.
您可以无头运行它,这使其无法打开浏览器并节省一些资源。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.