无法从 scrapy.CrawlerProcess 获取 Scrapy Stats

Question

我正在从另一个脚本运行爬虫，我需要从 Crawler 检索并保存到变量统计信息中。 我查看了文档和其他 StackOverflow 问题，但我无法解决这个问题。

这是我运行爬网的脚本：

import scrapy
from scrapy.crawler import CrawlerProcess


process = CrawlerProcess({})
process.crawl(spiders.MySpider)
process.start()

stats = CrawlerProcess.stats.getstats() # I need something like this

我希望统计信息包含这段数据（scrapy.statscollectors）：

     {'downloader/request_bytes': 44216,
     'downloader/request_count': 36,
     'downloader/request_method_count/GET': 36,
     'downloader/response_bytes': 1061929,
     'downloader/response_count': 36,
     'downloader/response_status_count/200': 36,
     'finish_reason': 'finished',
     'finish_time': datetime.datetime(2018, 11, 9, 16, 31, 2, 382546),
     'log_count/DEBUG': 37,
     'log_count/ERROR': 35,
     'log_count/INFO': 9,
     'memusage/max': 62623744,
     'memusage/startup': 62623744,
     'request_depth_max': 1,
     'response_received_count': 36,
     'scheduler/dequeued': 36,
     'scheduler/dequeued/memory': 36,
     'scheduler/enqueued': 36,
     'scheduler/enqueued/memory': 36,
     'start_time': datetime.datetime(2018, 11, 9, 16, 30, 38, 140469)}

我已经检查了 CrawlerProcess，它会在抓取过程完成后返回延迟并从其“爬虫”字段中删除爬虫。

有没有办法解决这个问题？

最好的，彼得

Answer 1

根据文档， CrawlerProcess.crawl接受爬虫或蜘蛛类，您可以通过CrawlerProcess.create_crawler从蜘蛛类创建爬虫。

因此，您可以在开始爬取过程之前创建爬虫实例，然后检索预期的属性。

下面我通过编辑几行原始代码为您提供了一个示例：

import scrapy
from scrapy.crawler import CrawlerProcess


class TestSpider(scrapy.Spider):
    name = 'test'
    start_urls = ['http://httpbin.org/get']

    def parse(self, response):
        self.crawler.stats.inc_value('foo')


process = CrawlerProcess({})
crawler = process.create_crawler(TestSpider)
process.crawl(crawler)
process.start()


stats_obj = crawler.stats
stats_dict = crawler.stats.get_stats()
# perform the actions you want with the stats object or dict

Answer 2

如果您想通过信号获取脚本中的统计信息。 这将有助于 -

 def spider_results(spider): results = [] stats = [] def crawler_results(signal, sender, item, response, spider): results.append(item) def crawler_stats(*args, **kwargs): # runs when spider closed stats.append(kwargs['sender'].stats.get_stats()) dispatcher.connect(crawler_results, signal=signals.item_scraped) dispatcher.connect(crawler_stats, signal=signals.spider_closed) process = CrawlerProcess() process.crawl(spider) process.start() # the script will block here until the crawling is finished return results, stats

我希望它有帮助！

无法从 scrapy.CrawlerProcess 获取 Scrapy Stats

问题描述

2 个解决方案

解决方案1
5 已采纳 2018-11-09 21:46:45

解决方案2
0 2022-07-13 13:12:06

无法从 scrapy.CrawlerProcess 获取 Scrapy Stats

问题描述

2 个解决方案

解决方案1 5 已采纳 2018-11-09 21:46:45

解决方案2 0 2022-07-13 13:12:06

解决方案1
5 已采纳 2018-11-09 21:46:45

解决方案2
0 2022-07-13 13:12:06