简体   繁体   中英

Dupefilter in Scrapy-Redis not working as expected

I'm interested in using Scrapy-Redis to store scraped items in Redis. In particular, the Redis-based request duplicates filter seems like a useful feature.

To start off, I adapted the spider at https://doc.scrapy.org/en/latest/intro/tutorial.html#extracting-data-in-our-spider as follows:

import scrapy
from tutorial.items import QuoteItem

class QuotesSpider(scrapy.Spider):
    name = "quotes"
    start_urls = [
        'http://quotes.toscrape.com/page/1/',
        'http://quotes.toscrape.com/page/2/',
    ]

    custom_settings = {'SCHEDULER': 'scrapy_redis.scheduler.Scheduler',
                       'DUPEFILTER_CLASS': 'scrapy_redis.dupefilter.RFPDupeFilter',
                       'ITEM_PIPELINES': {'scrapy_redis.pipelines.RedisPipeline': 300}}

    def parse(self, response):
        for quote in response.css('div.quote'):
            item = QuoteItem()
            item['text'] = quote.css('span.text::text').extract_first()
            item['author'] = quote.css('small.author::text').extract_first()
            item['tags'] = quote.css('div.tags a.tag::text').extract()
            yield item

where I generated the project using scrapy startproject tutorial at the command line and defined QuoteItem in items.py as

import scrapy

class QuoteItem(scrapy.Item):
    text = scrapy.Field()
    author = scrapy.Field()
    tags = scrapy.Field()

Basically, I've implemented the settings in the "Usage" section of the README in the settings per-spider and made the spider yield an Item object instead of a regular Python dictionary. (I figured this would be necessary to trigger the Item Pipeline ).

Now, if I crawl the spider using scrapy crawl quotes from the command line and then do redis-cli , I see a quotes:items key:

127.0.0.1:6379> keys *
1) "quotes:items"

which is a list of length 20:

127.0.0.1:6379> llen quotes:items
(integer) 20

If I run scrapy crawl quotes again, the length of the list doubles to 40:

127.0.0.1:6379> llen quotes:items
(integer) 40

However, I would expect the length of quotes:items to still be 20, since I have simply re-scraped the same pages. Am I doing something wrong here?

Scrapy-redis doesn't filter duplicate items automatically.

The (requests) dupefilter is about the requests in a crawl. What you want seems to be something similar to the deltafetch middleware: https://github.com/scrapy-plugins/scrapy-deltafetch

You would need to adapt deltafetch to work with a distributed storage, perhaps redis' bitmap feature will fit this case.

Here is how I fixed the problem in the end. First of all, as pointed out to me in a separate question, How to implement a custom dupefilter in Scrapy? , using the start_urls class variable results in an implementation of start_requests in which the yielded Request objects have dont_filter=True . To disable this and use the default dont_filter=False instead, I implemented start_requests directly:

import scrapy
from tutorial.items import QuoteItem

class QuotesSpider(scrapy.Spider):
    name = "quotes"

    custom_settings = {
                       'SCHEDULER': 'scrapy_redis.scheduler.Scheduler',
                       'DUPEFILTER_CLASS': 'tutorial.dupefilter.RedisDupeFilter',
                       'ITEM_PIPELINES': {'scrapy_redis.pipelines.RedisPipeline': 300}
                       }

    def start_requests(self):
        urls = [
            'http://quotes.toscrape.com/page/1/',
            'http://quotes.toscrape.com/page/2/',
        ]
        for url in urls:
            yield scrapy.Request(url=url, callback=self.parse)

    def parse(self, response):
        for quote in response.css('div.quote'):
            item = QuoteItem()
            item['text'] = quote.css('span.text::text').extract_first()
            item['author'] = quote.css('small.author::text').extract_first()
            item['tags'] = quote.css('div.tags a.tag::text').extract()
            yield item

Secondly, as pointed out by Rolando, the fingerprints aren't by default persisted across different crawls. To implement this, I subclassed Scrapy-Redis' RFPDupeFilter class:

import scrapy_redis.dupefilter
from scrapy_redis.connection import get_redis_from_settings


class RedisDupeFilter(scrapy_redis.dupefilter.RFPDupeFilter):
    @classmethod
    def from_settings(cls, settings):
        server = get_redis_from_settings(settings)
        key = "URLs_seen"                               # Use a fixed key instead of one containing a timestamp
        debug = settings.getbool('DUPEFILTER_DEBUG')
        return cls(server=server, key=key, debug=debug)

    def request_seen(self, request):
        added = self.server.sadd(self.key, request.url)
        return added == 0

    def clear(self):
        pass                                            # Don't delete the key from Redis

The main differences are (1) the key is set to a fixed value (not one containing a time stamp) and (2) the clear method, which in Scrapy-Redis' implementation deletes the key from Redis, is effectively disabled.

Now, when I run scrapy crawl quotes the second time, I see the expected log output

2017-05-05 15:13:46 [scrapy_redis.dupefilter] DEBUG: Filtered duplicate request <GET http://quotes.toscrape.com/page/1/> - no more duplicates will be shown (see DUPEFILTER_DEBUG to show all duplicates)

and no items are scraped.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM