强迫我的爬虫蜘蛛停止爬行

Question

is there a chance to stop crawling when specific if condition is true (like scrap_item_id == predefine_value ).是否有机会在特定条件为真时停止爬行（例如 scrap_item_id == predefine_value ）。 My problem is similar to Scrapy - how to identify already scraped urls but I want to 'force' my scrapy spider to stop crawling after discover the last scraped item.我的问题与Scrapy类似- 如何识别已抓取的网址，但我想在发现最后一个抓取的项目后“强制”我的抓取蜘蛛停止抓取。

Answer 1

In the latest version of Scrapy, available on GitHub, you can raise a CloseSpider exception to manually close a spider.在 GitHub 上提供的最新版本的 Scrapy 中，您可以引发 CloseSpider 异常以手动关闭蜘蛛。

In the0.14 release note doc is mentioned: "Added CloseSpider exception to manually close spiders (r2691)"在0.14 发行说明文档中提到：“添加 CloseSpider 异常以手动关闭蜘蛛 (r2691)”

Example as per the docs:根据文档的示例：

def parse_page(self, response):
  if 'Bandwidth exceeded' in response.body:
    raise CloseSpider('bandwidth_exceeded')

See also: http://readthedocs.org/docs/scrapy/en/latest/topics/exceptions.html?highlight=closeSpider另见： http : //readthedocs.org/docs/scrapy/en/latest/topics/exceptions.html?highlight=closeSpider

Answer 2

This question was asked 8 months ago but I was wondering the same thing and have found another (not great) solution.这个问题是 8 个月前提出的，但我想知道同样的事情，并找到了另一个（不是很好）的解决方案。 Hopefully this can help the future readers.希望这可以帮助未来的读者。

I'm connecting to a database in my Pipeline file, if the database connection is unsuccessful, I wanted the Spider to stop crawling (no point in collecting data if there's nowhere to send it).我正在连接到我的管道文件中的数据库，如果数据库连接不成功，我希望 Spider 停止爬行（如果无处发送数据，则收集数据没有意义）。 What I ended up doing was using:我最终做的是使用：

from scrapy.project import crawler
crawler._signal_shutdown(9,0) #Run this if the cnxn fails.

This causes the Spider to do the following:这会导致 Spider 执行以下操作：

[scrapy] INFO: Received SIGKILL, shutting down gracefully. Send again to force unclean shutdown.

I just kind of pieced this together after reading your comment and looking through the "/usr/local/lib/python2.7/dist-packages/Scrapy-0.12.0.2543-py2.7.egg/scrapy/crawler.py" file.在阅读您的评论并查看“/usr/local/lib/python2.7/dist-packages/Scrapy-0.12.0.2543-py2.7.egg/scrapy/crawler.py”文件后，我只是将它们拼凑在一起. I'm not totally sure what it's doing, the first number delivered to the function is the signame (for example, using 3,0 instead of 9,0 returns error [scrapy] INFO: Received SIGKILL...我不完全确定它在做什么，传递给函数的第一个数字是签名（例如，使用 3,0 而不是 9,0 返回错误[scrapy] INFO: Received SIGKILL...

Seems to work well enough though.不过似乎工作得很好。 Happy scraping.快乐刮痧。

EDIT: I also suppose that you could just force your program to shut down with something like:编辑：我还认为您可以通过以下方式强制您的程序关闭：

import sys
sys.exit("SHUT DOWN EVERYTHING!")

Answer 3

From a pipeline, I prefer the following solution.从管道中，我更喜欢以下解决方案。

class MongoDBPipeline(object):

def process_item(self, item, spider):
    spider.crawler.engine.close_spider(self, reason='duplicate')

Source: Force spider to stop in scrapy来源：强制蜘蛛停止在scrapy

Answer 4

Tried lots of options nothing works.尝试了很多选项都不起作用。 This dirty hack do the trick for Linux:这个肮脏的黑客为 Linux 做的伎俩：

os.kill(os.getpid(), signal.SIGINT)
os.kill(os.getpid(), signal.SIGINT)

This sends SIGINT signal two times to scrapy.这向scrapy发送了两次SIGINT信号。 Second signal forces shutdown第二个信号强制关闭

强迫我的爬虫蜘蛛停止爬行

问题描述

4 个解决方案

解决方案1
40 2011-11-01 16:03:36

解决方案2
11 2011-08-16 03:23:15

解决方案3
3 2019-11-07 23:34:40

解决方案4
0 2020-08-10 18:37:04

强迫我的爬虫蜘蛛停止爬行

问题描述

4 个解决方案

解决方案1 40 2011-11-01 16:03:36

解决方案2 11 2011-08-16 03:23:15

解决方案3 3 2019-11-07 23:34:40

解决方案4 0 2020-08-10 18:37:04

解决方案1
40 2011-11-01 16:03:36

解决方案2
11 2011-08-16 03:23:15

解决方案3
3 2019-11-07 23:34:40

解决方案4
0 2020-08-10 18:37:04