简体   繁体   English

使用Beautiful Soup和Python刮擦多个搜索页面的结果

[英]Scrape results of multiple search pages with Beautiful Soup and Python

I am learning to use Beautiful Soup to scrape some info from a website. 我正在学习使用“美丽汤”从网站上抓取一些信息。 The website has multiple search results pages that I want to scrape. 该网站有多个我要抓取的搜索结果页面。

This is simple, as the URL changes for each page: 这很简单,因为每个页面的URL都会更改:

website.com/page1
website.com/page2
.
.

But I don't know in advance how many pages there will be. 但是我事先不知道会有多少页。 So I don't want to try to scrape website.com/page13 if there isn't one or if website.com/page13 just shows the last results page which may have been website.com/page9 . 所以,我不想尝试刮website.com/page13如果没有一个或者website.com/page13只显示最后的结果页面可能已经website.com/page9

Is there a way I can stop scraping when I reach the final results page? 到达最终结果页面后,有什么方法可以停止抓取吗?

Often search pages have results with some sort of indexing. 通常,搜索页面的搜索结果带有某种索引。 If the page you are looking at has said indexing you can stop when you see the same index twice. 如果您正在查看的页面已说要编制索引,则当您两次看到相同的索引时可以停止。

Additionally you may run into pagination of results at the bottom of the page and you can tell from what page you are on whether you are at the end of the pagination in that list. 另外,您可能会在页面底部碰到结果分页,并且可以从哪个页面上知道您是否在该列表的分页末尾。

Furthermore, search pages usually have a set number of results displayed on each page, so in those cases you can assume that the page you are on is the last page if the results are suddenly fewer than that. 此外,搜索页面通常在每个页面上显示一定数量的结果,因此在这种情况下,如果结果突然少于该页面,则可以假定您所在的页面是最后一页。

Another way to differentiate in the case of repeated pages would be to keep the first result from the current page and compare it to the first result of the next page, if they are the same then you are done. 在重复页面的情况下进行区分的另一种方法是保留当前页面的第一个结果,并将其与下一页的第一个结果进行比较,如果它们相同,那么就完成了。

If you can give more detail on the page you are trying this on or more details on the scope of the problem I may give additional input. 如果您可以在此页面上提供更多详细信息,或者尝试在问题范围内提供更多详细信息,我可能会提供其他输入。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM