简体   繁体   中英

Scrapy - Use RabbitMQ only or Celery + RabbitMQ for scraping multiple websites?

I want to run multiple spiders to crawl many different websites. Websites I want to crawl take different time to be scraped (some take about 24h, others 4h, ...). I have multiple workers (less than the number of websites) to launch scrapy and a queue where I put the websites I want to crawl. Once a worker has finished crawling a website, the website goes back to the queue waiting for a worker to be available to launch scrapy, and so on. The problem is that small website will be crawled more times than big ones and I want all websites to be crawled the same number of time.

I was thinking about using RabbitMQ for queue management and to prioritise some websites. But when I search for RabbitMQ, it is often used with Celery. What I understood about these tools is that Celery will allow to launch some code depending on a schedule and RabbitMQ will use message and queues to define the execution order.

In my case, I don't know if using only RabbitMQ without Celery will work. Also, is using RabbitMQ helpful for my problem?

Thanks

Yes, using RabbitMQ is very helpful for your use case since your crawling agent can utilize a message queue for storing the results while your document processor can then store that in both your database back end (in this reply I'll assume mongodb) and your search engine (and I'll assume elastic search here).

What one gets in this scenario is a very rapid and dynamic search engine and crawler that can be scaled.

As for celery+rabbitmq+scrapy portion; celery would be a good way to schedule your scrapy crawlers and distribute your crawler bots across your infrastructure. Celery is just using RabbitMQ as its back end to consolidate and distribute the jobs between each instance. So for your use case to use both celery and scrapy just write the code for your scrapy bot to use its own rabbitmq queue for storing the results then write up a document processor to store the results into your persistent database back end. Then setup celery to schedule the batches of site crawls. Throw in sched module to maintain a bit of sanity in your crawling scheude.

Also, review the works done at google for how they resolve the issues for over crawling a site in thier algorithm plus respect sane robots.txt settings and your crawler should be good to go.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM