简体   繁体   English

使用Scrapy递归地为链接捕获域

[英]Scraping a domain for links recursively using Scrapy

Here is the code I'm using for scraping all the urls of a domain: 这是我用于抓取域中所有URL的代码:

import scrapy
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors.lxmlhtml import LxmlLinkExtractor

class UrlsSpider(scrapy.Spider):
    name = 'urlsspider'
    allowed_domains = ['example.com']
    start_urls = ['http://example.com/']

    rules = (Rule(LxmlLinkExtractor(allow=(), unique=True), callback='parse', follow=True))

    def parse(self, response):
        for link in LxmlLinkExtractor(allow_domains=self.allowed_domains, unique=True).extract_links(response):
            print link.url

            yield scrapy.Request(link.url, callback=self.parse)

As you can see that I've used unique=True but it's still printing duplicate urls in the terminal whereas I want only the unique urls and not duplicate urls. 如您所见,我使用了unique=True但它仍在终端中打印重复的url,而我只想要唯一的url,而不想要重复的url。

Any help on this matter will be very helpful. 在这方面的任何帮助将非常有帮助。

Since the code looks at the content of the URLs recursively, you will see the duplicate URLs from the parsing of other pages. 由于代码以递归方式查看URL的内容,因此您将从其他页面的解析中看到重复的URL。 You essentially have multiple instances of LxmlLinkExtractor(). 本质上,您具有LxmlLinkExtractor()的多个实例。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM