简体   繁体   English

抑制管道后打印在日志中的 Scrapy 项目

[英]suppress Scrapy Item printed in logs after pipeline

I have a scrapy project where the item that ultimately enters my pipeline is relatively large and stores lots of metadata and content.我有一个scrapy项目,其中最终进入我的管道的项目相对较大并且存储了大量元数据和内容。 Everything is working properly in my spider and pipelines.在我的蜘蛛和管道中一切正常。 The logs, however, are printing out the entire scrapy Item as it leaves the pipeline (I believe):然而,当它离开管道时,日志会打印出整个scrapy Item(我相信):

2013-01-17 18:42:17-0600 [tutorial] DEBUG: processing Pipeline pipeline module
2013-01-17 18:42:17-0600 [tutorial] DEBUG: Scraped from <200 http://www.example.com>
    {'attr1': 'value1',
     'attr2': 'value2',
     'attr3': 'value3',
     ...
     snip
     ...
     'attrN': 'valueN'}
2013-01-17 18:42:18-0600 [tutorial] INFO: Closing spider (finished)

I would rather not have all this data puked into log files if I can avoid it.如果可以避免的话,我宁愿不将所有这些数据都放入日志文件中。 Any suggestions about how to suppress this output?有关如何抑制此输出的任何建议?

Another approach is to override the __repr__ method of the Item subclasses to selectively choose which attributes (if any) to print at the end of the pipeline:另一种方法是覆盖Item子类的__repr__方法,以有选择地选择在管道末尾打印哪些属性(如果有):

from scrapy.item import Item, Field
class MyItem(Item):
    attr1 = Field()
    attr2 = Field()
    # ...
    attrN = Field()

    def __repr__(self):
        """only print out attr1 after exiting the Pipeline"""
        return repr({"attr1": self.attr1})

This way, you can keep the log level at DEBUG and show only the attributes that you want to see coming out of the pipeline (to check attr1 , for example).这样,您可以将日志级别保持在DEBUG并仅显示您希望从管道中看到的属性(例如检查attr1 )。

Having read through the documentation and conducted a (brief) search through the source code, I can't see a straightforward way of achieving this aim.通读了文档并通过源代码进行了(简要)搜索后,我看不到实现这一目标的直接方法。

The hammer approach is to set the logging level in the settings to INFO (ie add the following line to settings.py):锤子方法是将设置中的日志记录级别设置为 INFO(即在 settings.py 中添加以下行):

LOG_LEVEL='INFO'

This will strip out a lot of other information about the URLs/page that are being crawled, but it will definitely suppress data about processed items.这将删除有关正在抓取的 URL/页面的许多其他信息,但它肯定会抑制有关已处理项目的数据。

I tried the repre way mentioned by @dino, it doesn't work well.我尝试了@dino 提到的repre方式,但效果不佳。 But evolved from his idea, I tried the str method, and it works.但是从他的想法演变而来,我尝试了str方法,并且它起作用了。

Here's how I do it, very simple:这是我的做法,非常简单:

    def __str__(self):
        return ""

If you want to exclude only some attributes of the output, you can extend the answer given by @dino如果只想排除输出的某些属性,可以扩展@dino 给出的答案

from scrapy.item import Item, Field
import json

class MyItem(Item):
    attr1 = Field()
    attr2 = Field()
    attr1ToExclude = Field()
    attr2ToExclude = Field()
    # ...
    attrN = Field()

    def __repr__(self):
        r = {}
        for attr, value in self.__dict__['_values'].iteritems():
            if attr not in ['attr1ToExclude', 'attr2ToExclude']:
                r[attr] = value
        return json.dumps(r, sort_keys=True, indent=4, separators=(',', ': '))

If you found your way here because you had the same question years later, the easiest way to do this is with a LogFormatter :如果您是因为多年后遇到同样的问题而在这里找到自己的方式,那么最简单的方法是使用LogFormatter

class QuietLogFormatter(scrapy.logformatter.LogFormatter):
    def scraped(self, item, response, spider):
        return (
            super().scraped(item, response, spider)
            if spider.settings.getbool("LOG_SCRAPED_ITEMS")
            else None
        )

Just add LOG_FORMATTER = "path.to.QuietLogFormatter" to your settings.py and you will see all your DEBUG messages except for the scraped items.只需将LOG_FORMATTER = "path.to.QuietLogFormatter"添加到您的settings.py ,您将看到除LOG_FORMATTER = "path.to.QuietLogFormatter"项目之外的所有DEBUG消息。 With LOG_SCRAPED_ITEMS = True you can restore the previous behaviour without having to change your LOG_FORMATTER .使用LOG_SCRAPED_ITEMS = True您可以恢复以前的行为而无需更改您的LOG_FORMATTER

Similarly you can customise the logging behaviour for crawled pages and dropped items.同样,您可以自定义已爬网页面和已删除项目的日志记录行为。

Edit : I wrapped up this formatter and some other Scrapy stuff in this library .编辑:我在这个库中包含了这个格式化程序和其他一些 Scrapy 的东西。

or If you know that spider is working correctly then you can disable the entire logging或者如果您知道蜘蛛正常工作,那么您可以禁用整个日志记录

LOG_ENABLED = False

I disable that when my crawler runs fine当我的爬虫运行良好时,我禁用它

I think the cleanest way to do this is to add a filter to the scrapy.core.scraper logger that changes the message in question.我认为最scrapy.core.scraper方法是向scrapy.core.scraper记录器添加一个过滤器来改变有问题的消息。 This allows you to keep your Item's __repr__ intact and to not have to change scrapy's logging level:这使您可以保持 Item 的__repr__完好无损,而不必更改 scrapy 的日志记录级别:

import re

class ItemMessageFilter(logging.Filter):
    def filter(self, record):
        # The message that logs the item actually has raw % operators in it,
        # which Scrapy presumably formats later on
        match = re.search(r'(Scraped from %\(src\)s)\n%\(item\)s', record.msg)
        if match:
            # Make the message everything but the item itself
            record.msg = match.group(1)
        # Don't actually want to filter out this record, so always return 1
        return 1

logging.getLogger('scrapy.core.scraper').addFilter(ItemMessageFilter())

We use the following sample in production:我们在生产中使用以下示例:

import logging

logging.getLogger('scrapy.core.scraper').addFilter(
    lambda x: not x.getMessage().startswith('Scraped from'))

This is a very simple and working code.这是一个非常简单且有效的代码。 We add this code in __init__.py in module with spiders.我们将此代码添加到__init__.py中的带有蜘蛛的模块中。 In this case this code automatically run with command like scrapy crawl <spider_name> for all spiders.在这种情况下,此代码会自动使用类似scrapy crawl <spider_name>类的命令运行所有蜘蛛。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM