Scrapy 不抓取页面上的所有项目

Question

I'm scraping a e-commerce site which has 48 products on each page except the last page.我正在抓取一个电子商务网站，除了最后一页外，每个页面上都有 48 种产品。

I'm using Scrapy for this.我正在为此使用 Scrapy。 The problem is, it is not scraping all products from the page.问题是，它没有从页面上抓取所有产品。 For example, it scrapes 12 from page 1, 18 from 2, 10 from 3, 19 from 4, and so on.例如，它从第 1 页爬取 12 个，从第 2 页爬取 18 个，从第 3 页爬取 10 个，从第 4 页爬取 19 个，依此类推。 It should scrape all 48 products from each page, but it doesn't.它应该从每个页面中抓取所有 48 个产品，但它没有。

Below is my script.下面是我的脚本。 For the last 2 days, I can't figure out what am I doing wrong.在过去的 2 天里，我无法弄清楚我做错了什么。

UPDATE I deduped the url list before scraping and added log messages to find out what the issue is.更新我在抓取之前删除了 url 列表并添加了日志消息以找出问题所在。 Current code:当前代码：

import scrapy
from productspider.items import Product
from urlparse import urlparse


class Ecommerce(scrapy.Spider):
    name = "ecommerce"

    def __init__(self, *args, **kwargs):
        urls = kwargs.pop('urls', [])
        if urls:
            self.start_urls = urls.split(',')
        self.logger.info(self.start_urls)
        super(Ecommerce, self).__init__(*args, **kwargs)

    page = 1
    parse_product_called = 0

    def parse(self, response):

        url = response.url
        if url.endswith('/'):
            url = url.rstrip('/')

        o = urlparse(url)

        products = response.xpath(
            "//a[contains(@href, '" + o.path + "/products/')]/@href").extract()

        if not products:
            raise scrapy.exceptions.CloseSpider("All products scraped")

        products = dedupe(products)

        self.logger.info("Products found on page %s = %s" % (self.page, len(products)))
        for product in products:
            yield scrapy.Request(response.urljoin(product), self.parse_product)

        self.page += 1
        next_page = o.path + "?page=" + str(self.page)
        yield scrapy.Request(response.urljoin(next_page), self.parse)

    def parse_product(self, response):

        self.parse_product_called += 1
        self.logger.info("Parse product called %s time" % self.parse_product_called)

        product = Product()
        product["name"] = response.xpath(
            "//meta[@property='og:title']/@content")[0].extract()
        product["price"] = response.xpath(
            "//meta[@property='og:price:amount']/@content")[0].extract()

        return product

def dedupe(seq, idfun=None):
   if idfun is None:
       def idfun(x): return x
   seen = {}
   result = []
   for item in seq:
       marker = idfun(item)
       if marker in seen: continue
       seen[marker] = 1
       result.append(item)
   return result

Scrapy log after crawling:爬取后的 Scrapy 日志：

2017-12-30 13:18:55 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 86621,
 'downloader/request_count': 203,
 'downloader/request_method_count/GET': 203,
 'downloader/response_bytes': 10925361,
 'downloader/response_count': 203,
 'downloader/response_status_count/200': 203,
 'finish_reason': 'All products scraped',
 'finish_time': datetime.datetime(2017, 12, 30, 7, 48, 55, 370000),
 'item_scraped_count': 193,
 'log_count/DEBUG': 397,
 'log_count/INFO': 210,
 'request_depth_max': 9,
 'response_received_count': 203,
 'scheduler/dequeued': 203,
 'scheduler/dequeued/memory': 203,
 'scheduler/enqueued': 418,
 'scheduler/enqueued/memory': 418,
 'start_time': datetime.datetime(2017, 12, 30, 7, 48, 22, 405000)}
2017-12-30 13:18:55 [scrapy.core.engine] INFO: Spider closed (All products scraped)

And the log messages:和日志消息：

2017-12-30 13:18:25 [ecommerce] INFO: Products found on page 1 = 48 2017-12-30 13:18:25 [电子商务] 信息：第 1 页上的产品 = 48

2017-12-30 13:18:32 [ecommerce] INFO: Products found on page 2 = 48 2017-12-30 13:18:32 [电子商务] 信息：第 2 页上的产品 = 48

2017-12-30 13:18:35 [ecommerce] INFO: Products found on page 3 = 48 2017-12-30 13:18:35 [电子商务] 信息：第 3 页上的产品 = 48

2017-12-30 13:18:38 [ecommerce] INFO: Products found on page 4 = 48 2017-12-30 13:18:38 [电子商务] 信息：第 4 页上的产品 = 48

2017-12-30 13:18:41 [ecommerce] INFO: Products found on page 5 = 48 2017-12-30 13:18:41 [电子商务] 信息：第 5 页上的产品 = 48

2017-12-30 13:18:43 [ecommerce] INFO: Products found on page 6 = 48 2017-12-30 13:18:43 [电子商务] 信息：第 6 页上的产品 = 48

2017-12-30 13:18:45 [ecommerce] INFO: Products found on page 7 = 48 2017-12-30 13:18:45 [电子商务] 信息：第 7 页上的产品 = 48

2017-12-30 13:18:48 [ecommerce] INFO: Products found on page 8 = 48 2017-12-30 13:18:48 [电子商务] 信息：第 8 页上的产品 = 48

2017-12-30 13:18:51 [ecommerce] INFO: Products found on page 9 = 24 2017-12-30 13:18:51 [电子商务] 信息：第 9 页上的产品 = 24

The log "Parse product called" was printed each time parse_product was called.每次调用 parse_product 时都会打印日志“Parse product called”。 The last log message is:最后一条日志消息是：

2017-12-30 13:18:55 [ecommerce] INFO: Parse product called 193 time 2017-12-30 13:18:55 [电子商务] INFO：解析产品调用193次

As you can see, it found a total of 408 products but called the parse_product function just 193. Hence only 193 items were scraped.如您所见，它总共找到了 408 个产品，但调用了parse_product函数的只有 193 个。因此只有 193 个项目被parse_product 。

Answer 1

Two issues in your code您的代码中有两个问题

Shutting the scraper down关闭刮刀

if not products:
   raise scrapy.exceptions.CloseSpider("All products scraped")

Using above you request the spider to terminate as soon as it can.使用上面的命令可以请求蜘蛛尽快终止。 This is not a good thing to do.这不是一件好事。 This is only used when you don't want the scraping to continue这仅在您不希望继续抓取时使用

Not ending the scraper不结束刮板

self.page += 1
next_page = o.path + "?page=" + str(self.page)
yield scrapy.Request(response.urljoin(next_page), self.parse)

You have an uncontrolled paging logic which needs to end.您有一个需要结束的不受控制的分页逻辑。 So you can use the fact that any page which doesn't have 48 products is the last page所以你可以使用这样一个事实，即任何没有 48 个产品的页面都是最后一页

self.page += 1
next_page = o.path + "?page=" + str(self.page)
if len(products) == 48:
   yield scrapy.Request(response.urljoin(next_page), self.parse)

Scrapy 不抓取页面上的所有项目

问题描述

1 个解决方案

解决方案1
1 已采纳 2017-12-30 12:04:11

Scrapy 不抓取页面上的所有项目

问题描述

1 个解决方案

解决方案1 1 已采纳 2017-12-30 12:04:11

解决方案1
1 已采纳 2017-12-30 12:04:11