如何在Scrapy中“暂停”蜘蛛？

Question

I'm using Tor (through Privoxy) for a scraping project, and would like to write a Scrapy extension (cf. https://doc.scrapy.org/en/latest/topics/extensions.html ) which requests a new identity (cf. https://stem.torproject.org/faq.html#how-do-i-request-a-new-identity-from-tor ) whenever a certain number of items are scraped. 我正在使用Tor（通过Privoxy）进行抓取项目，并想编写一个要求新身份的Scrapy扩展名（请参阅https://doc.scrapy.org/en/latest/topics/extensions.html ）（请参阅https://stem.torproject.org/faq.html#how-do-i-request-a-new-identity-from-tor ），只要刮掉一定数量的物品即可。

However, the changing of identity takes some time (a couple of seconds) during which I expect that nothing can be scraped. 但是，更改身份需要花费一些时间（几秒钟），在此期间我希望不会刮nothing任何东西。 Therefore, I would like to make the extension 'pause' the spider until the IP change has been completed. 因此，我想使扩展名“暂停”蜘蛛，直到IP更改完成。

Is this possible? 这可能吗？ (I have read some solutions about using Cntrl+C and specifying a JOBDIR , but this seems a bit drastic as I only want to pause the spider, and not stop the entire engine). （我已经阅读了一些有关使用Cntrl + C并指定JOBDIR解决方案，但这似乎有些过激，因为我只想暂停Spider，而不是停止整个引擎）。

Answer 1

Crawler engine has pause and unpause methods so you can try something like that: 搜寻器引擎具有pause和unpause方法，因此您可以尝试执行以下操作：

class SomeExtension(object):

   @classmethod
   def from_crawler(cls, crawler)
       o = cls(...)
       o.crawler = crawler
       return o

   def change_tor(self):
       self.crawler.engine.pause()
       # some python code implements changing logic
       self.crawler.engine.unpause()

如何在Scrapy中“暂停”蜘蛛？

问题描述

1 个解决方案

解决方案1
5 已采纳 2017-05-11 16:08:49

如何在Scrapy中“暂停”蜘蛛？

问题描述

1 个解决方案

解决方案1 5 已采纳 2017-05-11 16:08:49

解决方案1
5 已采纳 2017-05-11 16:08:49