简体   繁体   English

Scrapy 爬虫提取 url 但错过了一半回调

[英]Scrapy crawler extract urls but miss half callbacks

I'm facing a strange issue while trying to scrape this URL :我在尝试抓取此URL时遇到了一个奇怪的问题:

To perform the crawling, I designed this:为了执行爬行,我设计了这个:

class IkeaSpider(CrawlSpider) :

    name = "Ikea"
    allower_domains = ["http://www.ikea.com/"]
    start_urls = ["http://www.ikea.com/fr/fr/catalog/productsaz/8/"]

    rules = (
        Rule(SgmlLinkExtractor(allow=[r'.*/catalog/products/\d+']),
            callback='parse_page',
            follow=True),
            )

    logging.basicConfig(filename='example.log',level=logging.ERROR)

        def parse_page(self, response):

            for sel in response.xpath('//div[@class="rightContent"]'):

                 Blah blah blah

I launch the spider from the command-line, and I can see urls normally scraped, but, for some of them, the callback doesn't work (about half of them are normally scrapped).我从命令行启动蜘蛛,我可以看到 url 通常被抓取,但是,对于其中一些,回调不起作用(其中大约一半通常被抓取)。

As there is more than 150 links on this page, it may explain why the crawler is missing callbacks (too many jobs).由于此页面上有超过 150 个链接,这可能解释了为什么爬虫缺少回调(作业太多)。 Does some of you have any idea regarding this?你们中有人对此有任何想法吗?

This is the log:这是日志:

2015-12-25 09:02:55 [scrapy] INFO: Stored csv feed (107 items) in: test.csv 2015-12-25 09:02:55 [scrapy] INFO: Dumping Scrapy stats: 'downloader/request_bytes': 68554, 'downloader/request_count': 217, 'downloader/request_method_count/GET': 217, 'downloader/response_bytes': 4577452, 'downloader/response_count': 217, 'downloader/response_status_count/200': 216, 'downloader/response_status_count/404': 1, 'dupefilter/filtered': 107, 'file_count': 106, 'file_status_count/downloaded': 106, 'finish_reason': 'finished', 'finish_time': datetime.datetime(2015, 12, 25, 8, 2, 55, 548350), 'item_scraped_count': 107, 'log_count/DEBUG': 433, 'log_count/ERROR': 2, 'log_count/INFO': 8, 'log_count/WARNING': 1, 'request_depth_max': 2, 'response_received_count': 217, 'scheduler/dequeued': 110, 'scheduler/dequeued/memory': 110, 'scheduler/enqueued': 110, 'scheduler/enqueued/memory': 110, 'start_time': datetime.datetime(2015, 12, 25, 8, 2, 28, 656959) 2015-12-25 09:02:55 [scrapy] INFO: Spider closed (finish 2015-12-25 09:02:55 [scrapy] 信息:存储的 csv 提要(107 项)在:test.csv 2015-12-25 09:02:55 [scrapy] 信息:倾倒 Scrapy 统计信息:'downloader/request_bytes ': 68554, 'downloader/request_count': 217, 'downloader/request_method_count/GET': 217, 'downloader/response_bytes': 4577452, 'downloader/response_count': 217, 'downloader/response_status_count/200': 216, 'downloader /response_status_count/404': 1, 'dupefilter/filtered': 107, 'file_count': 106, 'file_status_count/downloaded': 106, 'finish_reason': 'finished', 'finish_time': datetime.datetime(2015, 12, 25, 8, 2, 55, 548350), 'item_scraped_count': 107, 'log_count/DEBUG': 433, 'log_count/ERROR': 2, 'log_count/INFO': 8, 'log_count/WARNING': 1, ' request_depth_max': 2, 'response_received_count': 217, 'scheduler/dequeued': 110, 'scheduler/dequeued/memory': 110, 'scheduler/enqueued': 110, 'scheduler/enqueued/memory': 110, 'start_time' : datetime.datetime(2015, 12, 25, 8, 2, 28, 656959) 2015-12-25 09:02:55 [scrapy] 信息:Spider 关闭(完成ed编辑

I've read a lot of things regarding my problem, and, apparently, the CrawlSpider class is not specific enough.我已经阅读了很多关于我的问题的文章,显然, CrawlSpider类不够具体。 It might explain why it misses some links, for some reasons I can't explain.它可能会解释为什么它会丢失一些链接,出于某些我无法解释的原因。 Basically, it is advised to use the BaseSpider class with start_requests and make_requests_from_url method to do the job in a more specific way.基本上,建议将BaseSpider类与start_requestsmake_requests_from_url方法一起使用,以更具体的方式完成这项工作。 I am still not completely sure on how to do it precisely.我仍然不完全确定如何精确地做到这一点。 that was just a hint.那只是一个提示。

I'm not a fan of these auto spider classes.我不喜欢这些自动蜘蛛类。 I usually just build exactly what I need.我通常只构建我需要的东西。

import scrapy

class IkeaSpider(scrapy.Spider) :

    name = "Ikea"
    allower_domains = ["http://www.ikea.com/"]
    start_urls = ["https://www.ikea.com/fr/fr/cat/produits-products/"]

    logging.basicConfig(filename='example.log',level=logging.ERROR)

        def parse(self, response):
            # You could also use a.vn-nav__link::attr(href) selector.
            for link in response.css('a:contains("/fr/cat/")::attr(href)').getall()
                yield scrapy.Request(link, callback=self.parse_category)

        def parse_category(self, response):
            # parse items or potential sub categories

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM