一只一只地跑不止一只蜘蛛

Question

I am using Scrapy framework to make spiders crawl through some webpages.我正在使用 Scrapy 框架让蜘蛛爬行一些网页。 Basically, what I want is to scrape web pages and save them to database.基本上，我想要的是抓取网页并将它们保存到数据库中。 I have one spider per webpage.我每个网页有一只蜘蛛。 But I am having trouble to run those spiders at once such that a spider starts to crawl exactly after another spiders finishes crawling.但是我无法立即运行这些蜘蛛，以至于在另一只蜘蛛完成爬行之后，蜘蛛开始爬行。 How can that be achieved?如何实现？ Is scrapyd the solution? scrapyd 是解决方案吗？

Answer 1

scrapyd 确实是一个好方法，可以使用max_proc或max_proc_per_cpu配置来限制并行 spdiers 的数量，然后您将使用 scrapyd rest api调度蜘蛛，例如：

$ curl http://localhost:6800/schedule.json -d project=myproject -d spider=somespider

一只一只地跑不止一只蜘蛛

问题描述

1 个解决方案

解决方案1
1 已采纳 2014-02-11 06:17:28

一只一只地跑不止一只蜘蛛

问题描述

1 个解决方案

解决方案1 1 已采纳 2014-02-11 06:17:28

解决方案1
1 已采纳 2014-02-11 06:17:28