简体繁体 English

单击“下一页”按钮时，抓取 URL 不会更改的网站

[英]Scraping a website that URL doesn't change when clicking on "next page" button

原文 2021-10-14 00:02:42 4 1 python/ selenium/ web-scraping

I'm trying to scrape a BBC website我正在尝试抓取 BBC 网站

https://www.bbc.com/news/topics/c95yz8vxvy8t/hong-kong-anti-government-protests https://www.bbc.com/news/topics/c95yz8vxvy8t/hong-kong-anti-government-protests

and I would like to get all the news articles.我想得到所有的新闻文章。 But the URL doesn't change when clicking on the next page button so I can only get the first page information.但是当点击下一页按钮时 URL 不会改变，所以我只能获取第一页信息。 Can anyone help?任何人都可以帮忙吗？ I'm using Selenium but familiar with requests too.我正在使用 Selenium，但也熟悉请求。 Thanks!谢谢！

1 个解决方案

use developer console in your browser, go to networks tab, disable cache.在浏览器中使用开发者控制台，转到网络选项卡，禁用缓存。 you can see api requests being made for each page change.您可以看到针对每个页面更改发出的 api 请求。 you dont need selenium, you can just use requests or aiohttp.你不需要硒，你可以只使用请求或 aiohttp。

this is an example: https://push.api.bbci.co.uk/batch?t=%2Fdata%2Fbbc-morph-lx-commentary-data-paged%2Fabout%2Fd5803bfc-472d-4abf-b334-d3fc4aa8ebf9%2FisUk%2Ffalse%2Flimit%2F20%2FnitroKey%2Flx-nitro%2FpageNumber%2F2%2Fversion%2F1.5.6?timeout=5这是一个例子： https : //push.api.bbci.co.uk/batch?t=%2Fdata%2Fbbc-morph-lx-commentary-data-paged%2Fabout%2Fd5803bfc-472d-4abf-b334-d3fc4aa8ebf9% 2FisUk%2Ffalse%2Flimit%2F20%2FnitroKey%2Flx-nitro%2FpageNumber%2F2%2Fversion%2F1.5.6?timeout=5

type "batch" in the filter bar and you should see only the api calls I believe to be responsible for page change.在过滤器栏中键入“batch”，您应该只看到我认为负责页面更改的 api 调用。

you can get the about id(d5803bfc-472d-4abf-b334-d3fc4aa8ebf9) of this topic in the webpage source.您可以在网页源中获取该主题的 about id(d5803bfc-472d-4abf-b334-d3fc4aa8ebf9)。 in this case in, https://www.bbc.com/news/topics/c95yz8vxvy8t/hong-kong-anti-government-protests在这种情况下， https://www.bbc.com/news/topics/c95yz8vxvy8t/hong-kong-anti-government-protests