用scrapy抓取多个页面

Question

I am trying to use scrapy to scrape a website that has several pages of information. 我正在尝试使用scrapy来抓取一个包含多页信息的网站。

my code is: 我的代码是：

from scrapy.spider import BaseSpider
from scrapy.selector import Selector
from tcgplayer1.items import Tcgplayer1Item


class MySpider(BaseSpider):
    name = "tcg"
    allowed_domains = ["http://www.tcgplayer.com/"]
    start_urls = ["http://store.tcgplayer.com/magic/journey-into-nyx?PageNumber=1"]

    def parse(self, response):
        hxs = Selector(response)
        titles = hxs.xpath("//div[@class='magicCard']")
        for title in titles:
            item = Tcgplayer1Item()
            item["cardname"] = title.xpath(".//li[@class='cardName']/a/text()").extract()[0]

            vendor = title.xpath(".//tr[@class='vendor ']")
            item["price"] = vendor.xpath("normalize-space(.//td[@class='price']/text())").extract()
            item["quantity"] = vendor.xpath("normalize-space(.//td[@class='quantity']/text())").extract()
            item["shipping"] = vendor.xpath("normalize-space(.//span[@class='shippingAmount']/text())").extract()
            item["condition"] = vendor.xpath("normalize-space(.//td[@class='condition']/a/text())").extract()
            item["vendors"] = vendor.xpath("normalize-space(.//td[@class='seller']/a/text())").extract()
            yield item

I am trying to scrape all the pages until it reaches the end of the pages ... sometimes there will be more pages than others so its hard to say exactly where the page numbers end. 我试图刮掉所有页面，直到它到达页面的末尾...有时会有比其他页面更多的页面，因此很难准确说出页码的结束位置。

Answer 1

The idea is to increment pageNumber until there is no titles found. 想法是增加pageNumber直到找不到titles 。 If no titles on the page - throw CloseSpider exception to stop the spider: 如果页面上没有titles - 抛出CloseSpider异常来停止蜘蛛：

from scrapy.spider import BaseSpider
from scrapy.selector import Selector
from scrapy.exceptions import CloseSpider
from scrapy.http import Request
from tcgplayer1.items import Tcgplayer1Item


URL = "http://store.tcgplayer.com/magic/journey-into-nyx?pageNumber=%d"

class MySpider(BaseSpider):
    name = "tcg"
    allowed_domains = ["tcgplayer.com"]
    start_urls = [URL % 1]

    def __init__(self):
        self.page_number = 1

    def parse(self, response):
        print self.page_number
        print "----------"

        sel = Selector(response)
        titles = sel.xpath("//div[@class='magicCard']")
        if not titles:
            raise CloseSpider('No more pages')

        for title in titles:
            item = Tcgplayer1Item()
            item["cardname"] = title.xpath(".//li[@class='cardName']/a/text()").extract()[0]

            vendor = title.xpath(".//tr[@class='vendor ']")
            item["price"] = vendor.xpath("normalize-space(.//td[@class='price']/text())").extract()
            item["quantity"] = vendor.xpath("normalize-space(.//td[@class='quantity']/text())").extract()
            item["shipping"] = vendor.xpath("normalize-space(.//span[@class='shippingAmount']/text())").extract()
            item["condition"] = vendor.xpath("normalize-space(.//td[@class='condition']/a/text())").extract()
            item["vendors"] = vendor.xpath("normalize-space(.//td[@class='seller']/a/text())").extract()
            yield item

        self.page_number += 1
        yield Request(URL % self.page_number)

This particular spider would go throw all 8 pages of the data, then stop. 这个特殊的蜘蛛会抛出所有8页数据，然后停止。

Hope that helps. 希望有所帮助。

用scrapy抓取多个页面

问题描述

1 个解决方案

解决方案1
10 已采纳 2014-05-27 20:09:25

用scrapy抓取多个页面

问题描述

1 个解决方案

解决方案1 10 已采纳 2014-05-27 20:09:25

解决方案1
10 已采纳 2014-05-27 20:09:25