将蜘蛛的名字添加到每一行日志中

Question

我正在寻找一种方法，为 Scrapy 生成的每个日志加上生成它的蜘蛛的名称作为前缀。 直到现在，我都是在一个循环中同步启动每个爬虫，所以很容易跟踪哪个爬虫生成了哪个日志。 但我最近重构了我的代码，以便接受一个蜘蛛列表作为参数，或者通过CrawlerProcess() function 一次启动它们。结果是它们是异步启动的，所以日志都混在一起了。

我考虑过在 LOG_FORMAT 设置中添加类似[%(name)]的内容，但生成的名称是调用它的模块（scrapy.core.engine、scrapy.utils.log 等），而不是蜘蛛的名称。

我还尝试创建一个扩展，通过检索spider.name并将其添加到 LOG_FORMAT 常量来修改爬虫的设置，但据我所知，在爬虫运行时更改设置没有效果（而且我还没有找到了一种干净的方法，因为它们是不可变的）。

任何帮助将不胜感激！ 谢谢

我尝试设置自定义 LOG_FORMAT 但似乎没有任何方法可以访问蜘蛛的名称；
我尝试使用extension来捕获爬虫的设置并修改它们，但它们是不可变的，并且只在过程开始时进行评估；

Answer 1

您需要创建自定义日志格式，并将其设置为项目的日志格式化程序。

基本上，您需要扩展 Scrapy 的日志格式化程序并使用新格式设置消息。

爬网和抓取的示例：

主2.py：

from scrapy import logformatter
import logging
import os
from twisted.python.failure import Failure
from scrapy.utils.request import referer_str

from scrapy.crawler import CrawlerProcess
from scrapy.utils.project import get_project_settings


SCRAPEDMSG = "Scraped from %(src)s" + os.linesep + "%(item)s"
# DROPPEDMSG = "Dropped: %(exception)s" + os.linesep + "%(item)s"
CRAWLEDMSG = "Crawled (%(status)s) %(request)s%(request_flags)s (referer: %(referer)s)%(response_flags)s"
# ITEMERRORMSG = "Error processing %(item)s"
# SPIDERERRORMSG = "Spider error processing %(request)s (referer: %(referer)s)"
# DOWNLOADERRORMSG_SHORT = "Error downloading %(request)s"
# DOWNLOADERRORMSG_LONG = "Error downloading %(request)s: %(errmsg)s"


class ExampleLogFormatter(logformatter.LogFormatter):
    def crawled(self, request, response, spider):
        request_flags = f' {str(request.flags)}' if request.flags else ''
        response_flags = f' {str(response.flags)}' if response.flags else ''
        return {
            'level': logging.DEBUG,
            'msg': f'{spider.name} {CRAWLEDMSG}',
            'args': {
                'status': response.status,
                'request': request,
                'request_flags': request_flags,
                'referer': referer_str(request),
                'response_flags': response_flags,
                # backward compatibility with Scrapy logformatter below 1.4 version
                'flags': response_flags
            }
        }

    def scraped(self, item, response, spider):
        if isinstance(response, Failure):
            src = response.getErrorMessage()
        else:
            src = response
        return {
            'level': logging.DEBUG,
            'msg': f'{spider.name} {SCRAPEDMSG}',
            'args': {
                'src': src,
                'item': item,
            }
        }


if __name__ == "__main__":
    spider = 'example_spider'
    settings = get_project_settings()
    settings['USER_AGENT'] = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.169 Safari/537.36'
    settings['LOG_FORMATTER'] = 'tempbuffer.main2.ExampleLogFormatter'
    process = CrawlerProcess(settings)
    process.crawl(spider)
    process.start()

蜘蛛.py:

import scrapy


class ExampleSpider(scrapy.Spider):
    name = 'example_spider'
    allowed_domains = ['scrapingclub.com']
    start_urls = ['https://scrapingclub.com/exercise/detail_basic/']

    def parse(self, response):
        item = dict()
        item['title'] = response.xpath('//h3/text()').get()
        item['price'] = response.xpath('//div[@class="card-body"]/h4/text()').get()
        yield item

Output：

[scrapy.core.engine] DEBUG: example_spider Crawled (200) <GET https://scrapingclub.com/exercise/detail_basic/> (referer: None)
[scrapy.core.scraper] DEBUG: example_spider Scraped from <200 https://scrapingclub.com/exercise/detail_basic/>
{'title': 'Long-sleeved Jersey Top', 'price': '$12.99'}

更新：

非全球工作解决方案：

import logging
import scrapy
from scrapy.utils.log import configure_logging


class ExampleSpider(scrapy.Spider):
    name = 'example_spider'
    allowed_domains = ['scrapingclub.com']
    start_urls = ['https://scrapingclub.com/exercise/detail_basic/']

    configure_logging(install_root_handler=False)
    logging.basicConfig(level=logging.DEBUG, format=name + ': %(levelname)s: %(message)s')

    def parse(self, response):
        item = dict()
        item['title'] = response.xpath('//h3/text()').get()
        item['price'] = response.xpath('//div[@class="card-body"]/h4/text()').get()
        yield item

Answer 2

感谢@SuperUser，我设法完成了我需要的事情，而不必在每个蜘蛛中单独添加代码。 一切都发生在扩展内部，更具体地说是在spider_opened方法内部。 这是代码：

class CustomLogExtension:

    class ContentFilter(logging.Filter):
        """
        Creates a filter that will
        """
        def filter(self, record):
            record.spider_name = ''
            # enter the spider's name
            if hasattr(record, 'spider'):
                record.spider_name = record.spider.name

            return True

    @classmethod
    def from_crawler(cls, crawler):
        # first check if the extension should be enabled and raise NotConfigured otherwise
        if not crawler.settings.getbool('CUSTOM_LOG_EXTENSION'):
            raise NotConfigured

        # instantiate the extension object
        ext = cls()

        # connect the extension object to signals
        crawler.signals.connect(ext.spider_opened, signal=signals.spider_opened)

        # return the extension object
        return ext

    def spider_opened(self, spider):
        """Prefixes the spider's name to every log emitted."""

        formatter = logging.Formatter('[%(spider_name)s] %(asctime)s [%(name)s] %(levelname)s: %(message)s')
        # add the new format and filter to all the handlers
        for handler in logging.root.handlers:
            handler.formatter = formatter
            handler.addFilter(self.ContentFilter())

将蜘蛛的名字添加到每一行日志中

问题描述

2 个解决方案

解决方案1
2 2023-01-13 10:32:52

解决方案2
0 已采纳 2023-01-17 16:02:39

将蜘蛛的名字添加到每一行日志中

问题描述

2 个解决方案

解决方案1 2 2023-01-13 10:32:52

解决方案2 0 已采纳 2023-01-17 16:02:39

解决方案1
2 2023-01-13 10:32:52

解决方案2
0 已采纳 2023-01-17 16:02:39