简体繁体 English

Puppeteer Crawler 大型爬行

[英]Puppeteer Crawler large scale crawling

原文 2020-12-04 16:15:04 1 2 web-crawler/ puppeteer/ google-chrome-headless

We are writing a web crawler using Puppeteer.我们正在使用 Puppeteer 编写 web 爬虫。 The puppeteer crawler that we wrote executes and crawl the website URLs without problem for websites with pages like some 1,500 - 5,000 However when we execute for websites more than 5,000 and if it breaks in the middle due to some error or crash, then it requires to start over again.我们编写的 puppeteer 爬虫程序对于页面数约为 1,500 - 5,000 的网站执行和爬取网站 URL 没有问题。但是，当我们为超过 5,000 个的网站执行时，如果由于某些错误或崩溃而在中间中断，则需要重新开始。 How to make Puppeteer based web crawler to resume from the last state of crawling if any error happened?如果发生任何错误，如何使基于 Puppeteer 的 web 爬虫从最后一次爬取的 state 恢复？ Are there any built-in functions in Puppeteer? Puppeteer 中有内置函数吗？ How to make this puppeteer headless chrome web crawling through a queue system?如何让这个 puppeteer headless chrome web 爬过队列系统？

2 个解决方案

You can store the URLs for the pages that you want to scrape in some kind of queue, for example using AWS SQS .您可以将要抓取的页面的 URL 存储在某种队列中，例如使用AWS SQS 。 Then you can run multiple Javascript processes on different servers or containers by using Node.js .然后，您可以使用Node.js在不同的服务器或容器上运行多个 Javascript 进程。 These worker jobs fetch one by one links from the common queue and crawl the corresponding page with Puppeteer in headless mode.这些工作作业从公共队列中一一获取链接，并以无头模式使用Puppeteer抓取相应的页面。 If more links are found which should be crawled, then they can also be added to this queue (links back to already crawled pages need to be avoided of course, because they lead to infinite loops).如果发现更多应该被爬取的链接，那么它们也可以被添加到这个队列中（当然需要避免返回到已经爬取的页面的链接，因为它们会导致无限循环）。 The results can be stored in a different queue or in a database.结果可以存储在不同的队列或数据库中。 If a worker process crashes, it can be restarted using one of the usual Open Source utilities for managing and monitoring processes.如果工作进程崩溃，可以使用常用的开源实用程序之一重新启动它来管理和监视进程。 Eventually the original queue will be empty, which indicates that all pages have been crawled.最终原始队列将是空的，这表明所有页面都已被爬取。

I built crawler myself with Puppeteer.js to crawl Google and Bing, and I struggled with it for a long time.我自己用 Puppeteer.js 搭建了爬虫来爬取 Google 和 Bing，苦苦挣扎了很久。 I highly recommend to work with forever-monitor to restart the crawler each time the browser crash or the page call hanged.每次浏览器崩溃或页面调用挂起时，我强烈建议使用永远监视器重新启动爬虫。 Second, hilly recommend to do add page.relaod when the page doesn't respond for more then 60 seconds (do it with promise).其次，hilly 建议在页面超过 60 秒没有响应时添加page.relaod （使用 promise 进行）。