简体   繁体   English

Scrapy 不抓取任何页面

[英]Scrapy Not Crawling Any Pages

I'm crawling the site https://oa.mo.gov/personnel/classification-specifications/all .我正在抓取网站https://oa.mo.gov/personnel/classification-specifications/all I need to get to each position page and then extract some information.我需要访问每个职位页面,然后提取一些信息。 I figure I could do this with a LinkExtractor or by finding all the URLs with xPath, which is what I'm attempting below.我想我可以使用 LinkExtractor 或通过使用 xPath 查找所有 URL 来做到这一点,这就是我在下面尝试的方法。 The spider doesn't show any errors, but also doesn't crawl any pages:蜘蛛不会显示任何错误,但也不会抓取任何页面:

import scrapy
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
from StateOfMoJDs.items import StateOfMoJDs

class StateOfMoJDs(scrapy.Spider):
    name = 'StateOfMoJDs'
    allowed_domains = ['oa.mo.gov']
    start_urls = ['https://oa.mo.gov/personnel/classification-specifications/all']

    def parse(self, response):
        for url in response.xpath('//span[@class="field-content"]/a/@href').extract():
            url2 = 'https://oa.mo.gov' + url
            scrapy.Request(url2, callback=self.parse_job)


    def parse_job(self, response):
        item = StateOfMoJDs()
        item["url"] = response.url
        item["jobtitle"] = response.xpath('//span[@class="page-title"]/text()').extract()
        item["salaryrange"] = response.xpath('//*[@id="class-spec-compact"]/div/div[1]/div[2]/div[1]/div[2]/div/text()').extract()
        item["classnumber"] = response.xpath('//*[@id="class-spec-compact"]/div/div[1]/div[1]/div[1]/div/div[2]/div//text()').extract()
        item["paygrade"] = response.xpath('//*[@id="class-spec-compact"]/div/div[1]/div[3]/div/div[2]/div//text()').extract()
        item["definition"] = response.xpath('//*[@id="class-spec-compact"]/div/div[2]/div[1]/div[2]/div/p//text()').extract()
        item["jobduties"] = response.xpath('//*[@id="class-spec-compact"]/div/div[2]/div[2]/div[2]/div/div//text()').extract()
        item["basicqual"] = response.xpath('//*[@id="class-spec-compact"]/div/div[3]/div[1]/div/div//text()').extract()
        item["specialqual"] = response.xpath('//*[@id="class-spec-compact"]/div/div[3]/div[2]/div[2]/div//text()').extract()
        item["keyskills"] = response.xpath('//*[@id="class-spec-compact"]/div/div[4]/div/div[2]/div/div//text()').extract()
        yield item

When using scrapy shell, response.xpath('//span[@class="field-content"]/a/@href').extract() yields a comma-separated list of relative URLs:当使用scrapy shell 时, response.xpath('//span[@class="field-content"]/a/@href').extract()产生一个以逗号分隔的相对 URL 列表:

['/personnel/classification-specifications/3005', '/personnel/classification-specifications/3006', '/personnel/classification-specifications/3007', ...]

在您的parse()方法中,您需要yield您的请求:

yield scrapy.Request(url2, callback=self.parse_job)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM