Scraping a domain for links recursively using Scrapy

Question

Here is the code I'm using for scraping all the urls of a domain:

import scrapy
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors.lxmlhtml import LxmlLinkExtractor

class UrlsSpider(scrapy.Spider):
    name = 'urlsspider'
    allowed_domains = ['example.com']
    start_urls = ['http://example.com/']

    rules = (Rule(LxmlLinkExtractor(allow=(), unique=True), callback='parse', follow=True))

    def parse(self, response):
        for link in LxmlLinkExtractor(allow_domains=self.allowed_domains, unique=True).extract_links(response):
            print link.url

            yield scrapy.Request(link.url, callback=self.parse)

As you can see that I've used unique=True but it's still printing duplicate urls in the terminal whereas I want only the unique urls and not duplicate urls.

Any help on this matter will be very helpful.

Answer 1

Since the code looks at the content of the URLs recursively, you will see the duplicate URLs from the parsing of other pages. You essentially have multiple instances of LxmlLinkExtractor().

Scraping a domain for links recursively using Scrapy

Question

1 answers

solution1
0 2017-10-14 06:14:36

Scraping a domain for links recursively using Scrapy

Question

1 answers

solution1 0 2017-10-14 06:14:36

solution1
0 2017-10-14 06:14:36