Scrapy 不抓取任何页面

Question

I'm crawling the site https://oa.mo.gov/personnel/classification-specifications/all .我正在抓取网站https://oa.mo.gov/personnel/classification-specifications/all 。 I need to get to each position page and then extract some information.我需要访问每个职位页面，然后提取一些信息。 I figure I could do this with a LinkExtractor or by finding all the URLs with xPath, which is what I'm attempting below.我想我可以使用 LinkExtractor 或通过使用 xPath 查找所有 URL 来做到这一点，这就是我在下面尝试的方法。 The spider doesn't show any errors, but also doesn't crawl any pages:蜘蛛不会显示任何错误，但也不会抓取任何页面：

import scrapy
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
from StateOfMoJDs.items import StateOfMoJDs

class StateOfMoJDs(scrapy.Spider):
    name = 'StateOfMoJDs'
    allowed_domains = ['oa.mo.gov']
    start_urls = ['https://oa.mo.gov/personnel/classification-specifications/all']

    def parse(self, response):
        for url in response.xpath('//span[@class="field-content"]/a/@href').extract():
            url2 = 'https://oa.mo.gov' + url
            scrapy.Request(url2, callback=self.parse_job)


    def parse_job(self, response):
        item = StateOfMoJDs()
        item["url"] = response.url
        item["jobtitle"] = response.xpath('//span[@class="page-title"]/text()').extract()
        item["salaryrange"] = response.xpath('//*[@id="class-spec-compact"]/div/div[1]/div[2]/div[1]/div[2]/div/text()').extract()
        item["classnumber"] = response.xpath('//*[@id="class-spec-compact"]/div/div[1]/div[1]/div[1]/div/div[2]/div//text()').extract()
        item["paygrade"] = response.xpath('//*[@id="class-spec-compact"]/div/div[1]/div[3]/div/div[2]/div//text()').extract()
        item["definition"] = response.xpath('//*[@id="class-spec-compact"]/div/div[2]/div[1]/div[2]/div/p//text()').extract()
        item["jobduties"] = response.xpath('//*[@id="class-spec-compact"]/div/div[2]/div[2]/div[2]/div/div//text()').extract()
        item["basicqual"] = response.xpath('//*[@id="class-spec-compact"]/div/div[3]/div[1]/div/div//text()').extract()
        item["specialqual"] = response.xpath('//*[@id="class-spec-compact"]/div/div[3]/div[2]/div[2]/div//text()').extract()
        item["keyskills"] = response.xpath('//*[@id="class-spec-compact"]/div/div[4]/div/div[2]/div/div//text()').extract()
        yield item

When using scrapy shell, response.xpath('//span[@class="field-content"]/a/@href').extract() yields a comma-separated list of relative URLs:当使用scrapy shell 时， response.xpath('//span[@class="field-content"]/a/@href').extract()产生一个以逗号分隔的相对 URL 列表：

['/personnel/classification-specifications/3005', '/personnel/classification-specifications/3006', '/personnel/classification-specifications/3007', ...]

Answer 1

在您的parse()方法中，您需要yield您的请求：

yield scrapy.Request(url2, callback=self.parse_job)

Scrapy 不抓取任何页面

问题描述

1 个解决方案

解决方案1
2 已采纳 2019-01-29 19:04:51

Scrapy 不抓取任何页面

问题描述

1 个解决方案

解决方案1 2 已采纳 2019-01-29 19:04:51

解决方案1
2 已采纳 2019-01-29 19:04:51