使用Scrapy递归地为链接捕获域

Question

Here is the code I'm using for scraping all the urls of a domain: 这是我用于抓取域中所有URL的代码：

import scrapy
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors.lxmlhtml import LxmlLinkExtractor

class UrlsSpider(scrapy.Spider):
    name = 'urlsspider'
    allowed_domains = ['example.com']
    start_urls = ['http://example.com/']

    rules = (Rule(LxmlLinkExtractor(allow=(), unique=True), callback='parse', follow=True))

    def parse(self, response):
        for link in LxmlLinkExtractor(allow_domains=self.allowed_domains, unique=True).extract_links(response):
            print link.url

            yield scrapy.Request(link.url, callback=self.parse)

As you can see that I've used unique=True but it's still printing duplicate urls in the terminal whereas I want only the unique urls and not duplicate urls. 如您所见，我使用了unique=True但它仍在终端中打印重复的url，而我只想要唯一的url，而不想要重复的url。

Any help on this matter will be very helpful. 在这方面的任何帮助将非常有帮助。

Answer 1

Since the code looks at the content of the URLs recursively, you will see the duplicate URLs from the parsing of other pages. 由于代码以递归方式查看URL的内容，因此您将从其他页面的解析中看到重复的URL。 You essentially have multiple instances of LxmlLinkExtractor(). 本质上，您具有LxmlLinkExtractor（）的多个实例。

使用Scrapy递归地为链接捕获域

问题描述

1 个解决方案

解决方案1
0 2017-10-14 06:14:36

使用Scrapy递归地为链接捕获域

问题描述

1 个解决方案

解决方案1 0 2017-10-14 06:14:36

解决方案1
0 2017-10-14 06:14:36