Scrapy XPath selector

Question

I am scraping this site and I'm using Scrapy as the means. However, I am having trouble with the XPath. I'm not entirely sure what is going on:

Why does this work:

def parse_item(self, response):
    item = BotItem()

    for title in response.xpath('//h1'):
        item['title'] = title.xpath('strong/text()').extract()
        item['wage'] = title.xpath('span[@class="price"]/text()').extract()
        yield item

and the following code not?

def parse_item(self, response):
    item = BotItem()

    for title in response.xpath('//body'):
        item['title'] = title.xpath('h1/strong/text()').extract()
        item['wage'] = title.xpath('h1/span[@class="price"]/text()').extract()
        yield item

I aim to also extract the XPath for:

//div[@id="description"]/p

But I can't because it is outside the h1 node. How can I achieve this? My full code is:

import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule

from bot.items import BotItem


class MufmufSpider(CrawlSpider):
    name = 'mufmuf'
    allowed_domains = ['mufmuf.ro']
    start_urls = ['http://mufmuf.ro/locuri-de-munca/joburi-in-strainatate/']

    rules = (
        Rule(
            LinkExtractor(restrict_xpaths='//div[@class="paginate"][position() = last()]'), 
            #callback='parse_start_url', 
            follow=True
        ),
        Rule(
            LinkExtractor(restrict_xpaths='//h3/a'), 
            callback='parse_item', 
            follow=True
        ),

    def parse_item(self, response):
        item = BotItem()

        for title in response.xpath('//h1'):
            item['title'] = title.xpath('strong/text()').extract()
            item['wage'] = title.xpath('span[@class="price"]/text()').extract()
            #item['description'] = title.xpath('div[@id="descirption"]/p/text()').extract()
            yield item

Answer 1

The for title in response.xpath('//body'): option does not work because your XPath expressions in the loop make it search for h1 element directly inside the body element.

Moreover, since there is only one desired entity to extract you don't need a loop here at all:

def parse_item(self, response):
    item = BotItem()

    item["title"] = response.xpath('//h1/strong/text()').extract()
    item["wage"] = response.xpath('//h1/span[@class="price"]/text()').extract()
    item["description"] = response.xpath('//div[@id="description"]/p/text()').extract()

    return item

(this should also answer your second question about the description )

Scrapy XPath selector

Question

1 answers

solution1
4 ACCPTED 2015-07-22 22:01:10

Scrapy XPath selector

Question

1 answers

solution1 4 ACCPTED 2015-07-22 22:01:10

solution1
4 ACCPTED 2015-07-22 22:01:10