简体   繁体   中英

Scraping a domain for links recursively using Scrapy

Here is the code I'm using for scraping all the urls of a domain:

import scrapy
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors.lxmlhtml import LxmlLinkExtractor

class UrlsSpider(scrapy.Spider):
    name = 'urlsspider'
    allowed_domains = ['example.com']
    start_urls = ['http://example.com/']

    rules = (Rule(LxmlLinkExtractor(allow=(), unique=True), callback='parse', follow=True))

    def parse(self, response):
        for link in LxmlLinkExtractor(allow_domains=self.allowed_domains, unique=True).extract_links(response):
            print link.url

            yield scrapy.Request(link.url, callback=self.parse)

As you can see that I've used unique=True but it's still printing duplicate urls in the terminal whereas I want only the unique urls and not duplicate urls.

Any help on this matter will be very helpful.

Since the code looks at the content of the URLs recursively, you will see the duplicate URLs from the parsing of other pages. You essentially have multiple instances of LxmlLinkExtractor().

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM