简体   繁体   English

Scrapy CSS选择器

[英]Scrapy CSS selector

I am learning how to use scrapy but I am having some issue. 我正在学习如何使用scrapy,但是遇到了一些问题。 I wrote this code, following an online tutorial, to understand a bit more about it. 我在在线教程之后编写了这段代码,以进一步了解它。

import scrapy

class BrickSetSpider(scrapy.Spider):
name = 'brick_spider'
start_urls = ['http://brickset.com/sets/year-2016']

def parse(self, response):
    SET_SELECTOR = '.set'
    for brickset in response.css(SET_SELECTOR):

        NAME_SELECTOR = 'h1 a ::text'
        PIECES_SELECTOR = './/dl[dt/text() = "Pieces"]/dd/a/text()'
        MINIFIGS_SELECTOR = './/dl[dt/text() = "Minifigs"]/dd[2]/a/text()'
        PRICE_SELECTOR  =  './/dl[dt/text() = "RRP"]/dd[3]/text()'
        IMAGE_SELECTOR = 'img ::attr(src)'
        yield {
            'name': brickset.css(NAME_SELECTOR).extract_first(),
            'pieces': brickset.xpath(PIECES_SELECTOR).extract_first(),
            'minifigs': brickset.xpath(MINIFIGS_SELECTOR).extract_first(),
    'retail price': brickset.xpath(PRICE_SELECTOR).extract_first(),
            'image': brickset.css(IMAGE_SELECTOR).extract_first(),
        }

    NEXT_PAGE_SELECTOR = '.next a ::attr(href)'
    next_page = response.css(NEXT_PAGE_SELECTOR).extract_first()
    if next_page:
        yield scrapy.Request(
            response.urljoin(next_page),
            callback=self.parse
        )

Since the sites divide the product listed in years and this code crawls just data from 2016 I decided to extend it and analyze also the data of previous years. 由于站点将产品划分为多年,并且此代码仅检索2016年的数据,因此我决定对其进行扩展并同时分析前几年的数据。 The idea of the code is this: 代码的想法是这样的:

PREVIOUS_YEAR_SELECTOR = '...'
previous_year= response.css(PREVIOUS_YEAR_SELECTOR).extract_first()
if previous_year:
    yield scrapy.Request(
        response.urljoin(previous_year),
                callback=self.parse
            )

I tried different things but I really have no idea of what to write instead of '...' I also tried with xpath but nothing seems to work. 我尝试了不同的方法,但是我真的不知道该写些什么,而不是'...'。我也尝试过使用xpath,但是似乎没有任何效果。

Maybe you want to exploit the structure of the href attribute? 也许您想利用href属性的结构? It seems to follow the structure /sets/year-YYYY . 它似乎遵循结构/sets/year-YYYY By this you can use a regex based selector or - if you are lazy like my - just a contains() : 这样,您可以使用基于正则表达式的选择器,或者-如果您像我一样懒惰-只是一个contains()

XPath: //a[contains(@href,"/sets/year-")]/@href XPath: //a[contains(@href,"/sets/year-")]/@href

I'm not sure if this is also possible with CSS. 我不确定CSS是否也可以。 So the ... can be filled with: 因此...可以填充为:

PREVIOUS_YEAR_SELECTOR_XPATH = '//a[contains(@href,"/sets/year-")]/@href'
previous_year = response.xpath(PREVIOUS_YEAR_SELECTOR).extract_first()

But I think you will go for ALL years, so maybe you want to loop over the links: 但我认为您将走了整整一年,所以也许您想循环链接:

PREVIOUS_YEAR_SELECTOR_XPATH = '//a[contains(@href,"/sets/year-")]/@href'
for previous_year in response.xpath(PREVIOUS_YEAR_SELECTOR):
    yield scrapy.Request(response.urljoin(previous_year), callback=self.parse)

I think you are on a good way. 我认为您的情况很好。 Google for an CSS/XPATH cheat sheet that matches your needs and checkout the FirePath extension or similar. 在Google上找到符合您需求的CSS / XPATH备忘单,并检出FirePath扩展名或类似内容。 It speeds up the selector setup a lot :) 它大大加快了选择器的设置速度:)

You have at least two options here. 您在这里至少有两个选择。 The first is to use generic CrawlSpider and define which links you want to extract and follow. 第一种是使用通用的CrawlSpider并定义要提取和遵循的链接。 Something like this: 像这样:

class BrickSetSpider(scrapy.CrawlSpider):
    name = 'brick_spider'
    start_urls = ['http://brickset.com/sets']
    rules = (
        Rule(LinkExtractor(
            allow=r'\/year\-[\d]{4}'), callback='parse_bricks', follow=True),
    )
#Your method renamed to parse_bricks goes here

Note: you need to rename parse method to some other name like 'parse_bricks' since the CrawlSpider uses the parse method itself. 注意:由于CrawlSpider使用parse方法本身,因此您需要将解析方法重命名为其他名称,例如'parse_bricks'

The second options are to set start_urls to a page http://brickset.com/browse/sets containing all links to year sets and add a method to parse those links: 第二个选项是将start_urls设置为包含年集所有链接的页面http://brickset.com/browse/sets ,并添加一种方法来解析这些链接:

class BrickSetSpider(scrapy.Spider):
    name = 'brick_spider'
    start_urls = ['http://brickset.com/browse/sets']

    def parse(self, response):
        links = response.xpath(
            '//a[contains(@href, "/sets/year")]/@href').extract()
        for link in links:
            yield scrapy.Request(response.urljoin(link), callback=self.parse_bricks)

    # Your method renamed to parse_bricks goes here

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM