简体   繁体   中英

Puppeteer Crawler large scale crawling

We are writing a web crawler using Puppeteer. The puppeteer crawler that we wrote executes and crawl the website URLs without problem for websites with pages like some 1,500 - 5,000 However when we execute for websites more than 5,000 and if it breaks in the middle due to some error or crash, then it requires to start over again. How to make Puppeteer based web crawler to resume from the last state of crawling if any error happened? Are there any built-in functions in Puppeteer? How to make this puppeteer headless chrome web crawling through a queue system?

You can store the URLs for the pages that you want to scrape in some kind of queue, for example using AWS SQS . Then you can run multiple Javascript processes on different servers or containers by using Node.js . These worker jobs fetch one by one links from the common queue and crawl the corresponding page with Puppeteer in headless mode. If more links are found which should be crawled, then they can also be added to this queue (links back to already crawled pages need to be avoided of course, because they lead to infinite loops). The results can be stored in a different queue or in a database. If a worker process crashes, it can be restarted using one of the usual Open Source utilities for managing and monitoring processes. Eventually the original queue will be empty, which indicates that all pages have been crawled.

I built crawler myself with Puppeteer.js to crawl Google and Bing, and I struggled with it for a long time. I highly recommend to work with forever-monitor to restart the crawler each time the browser crash or the page call hanged. Second, hilly recommend to do add page.relaod when the page doesn't respond for more then 60 seconds (do it with promise).

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM