簡體   English   中英

Scrapy 不在下一頁

[英]Scrapy not following next page

我在這個問題上坐了很長時間,但我嘗試的一切都不起作用。 我的目標是簡單地從招聘網站中提取數據。 每個站點提供 20 個工作崗位。 我正在使用 scrapy 回調提取每個報價的數據。 這或多或少有效。 問題是 scrapy 不跳到下一頁,不管我怎么試。 我首先嘗試了 scrapy & selenium,不起作用。 現在我只嘗試使用 scrapy 並遵循教程,但它仍然只從第 1 頁的前 20 個報價中提取數據。

重要提示:下一個按鈕會更改整個頁面,這意味着它的 xpath/css 選擇器會更改。 我試過 css last-nth-child 和 xpath last()-1 但沒有令人滿意的結果。 更難的是,在變量 xpath 元素 aa 標記后面跟着鏈接。

這是代碼:

import scrapy
from random import randint
from time import sleep


class WorkpoolJobsSpider(scrapy.Spider):
name = 'getdata'
allowed_domains = ['workpool-jobs.ch']
start_urls = ['https://www.workpool-jobs.ch/recht-jobs']

def parse(self, response):
    SET_SELECTOR = "//p[@class='inserattitel h2 mt-0']/a/@href"
    for joboffer in response.xpath(SET_SELECTOR):
        url1 = response.urljoin(joboffer.get())
        yield scrapy.Request(url1, callback = self.parse_dir_contents)

    next_page = response.xpath(".//li[@class='page-item'][last()-1]/../@href").get()
    wait(randint(5,10))
    if next_page:
        yield response.follow(url=next_page, callback=self.parse)

def parse_dir_contents(self, response):
    single_info = response.xpath(".//*[@class='col-12 col-md mr-md-3 mr-xl-5']")

    for info in single_info:
        info_Titel = info.xpath(".//article/h1[@class='inserattitel']/text()").extract_first()
        info_Berufsfelder = info.xpath(".//article/div[@class='border-top-grau']/p/text()").extract()
        info_Arbeitspensum = info.xpath(".//article/div[@class='row bg-hellstblau']/div[@class='col-12 col-sm-6 col-lg-5']/dl/dd[1]/text()").extract_first()
        info_Anstellungsverhältnis = info.xpath(".//article/div[@class='row bg-hellstblau']/div[@class='col-12 col-sm-6 col-lg-5']/dl/dd[2]/text()").extract_first()
        info_Arbeitsort = info.xpath(".//article/div[@class='row bg-hellstblau']/div[@class='col-12 col-sm-6 col-lg-5']/dl/dd[4]/a/text()").extract()
        info_VerfügbarAb = info.xpath(".//article/div[@class='row bg-hellstblau']/div[@class='col-12 col-sm-6 col-lg-5']/dl/dd[5]/text()").extract()
        info_Kompetenzenqualifikation = info.xpath(".//article/div[@class='row bg-hellstblau']/div[@class='col-12 col-sm-6 col-lg-7']/dl[2]/dd/text()").extract_first()
        info_Aufgabengebiet = info.xpath(".//article/div[@class='border-bottom-grau'][1]//*[self::p or self::li]").extract()
        info_Erwartungen = info.xpath(".//article/div[@class='border-bottom-grau'][2]/ul/li[descendant-or-self::text()]").extract()
        info_WirBietenIhnen = info.xpath(".//article/div[@class='border-bottom-grau'][3]/ul/li[descendant-or-self::text()]").extract()
        info_Publikationsdatum = info.xpath(".//article/footer[@class='inseratfooter']/p[1]/strong/text()").extract_first()

        yield {'Titel': info_Titel,
        'Berufsfelder': info_Berufsfelder,
        'Arbeitspensum': info_Arbeitspensum,
        'Anstellungsverhältnis': info_Anstellungsverhältnis,
        'Arbeitsort': info_Arbeitsort,
        'VerfügbarAb': info_VerfügbarAb,
        'Kompetenzenqualifikation': info_Kompetenzenqualifikation,
        'Aufgabengebiet': info_Aufgabengebiet,
        'Erwartungen': info_Erwartungen,
        'WirBietenIhnen': info_WirBietenIhnen,
        'Publikationsdatum': info_Publikationsdatum}

非常感謝任何幫助!

有了一些來自 furas 的提示,我終於設法讓我的代碼正常工作。 如果將來有人遇到同樣的問題,也許我下面的代碼也可以幫助您:

import scrapy
from random import randint
from time import sleep


class WorkpoolJobsSpider(scrapy.Spider):
name = "getdata"
page_number = 2
allowed_domains = ["workpool-jobs.ch"]
start_urls = ["https://www.workpool-jobs.ch/recht-jobs"]

def parse(self, response):
    SET_SELECTOR = "//p[@class='inserattitel h2 mt-0']/a/@href"
    for joboffer in response.xpath(SET_SELECTOR):
        url1 = response.urljoin(joboffer.get())
        yield scrapy.Request(url1, callback = self.parse_dir_contents)

    next_page = "https://www.workpool-jobs.ch/recht-jobs?seite=" + str(WorkpoolJobsSpider.page_number)
    sleep(randint(5,10))
    if WorkpoolJobsSpider.page_number < 27:
        WorkpoolJobsSpider.page_number += 1
        yield response.follow(next_page, callback=self.parse)

def parse_dir_contents(self, response):
    single_info = response.xpath(".//*[@class='col-12 col-md mr-md-3 mr-xl-5']")

    for info in single_info:
        info_Titel = info.xpath(".//article/h1[@class='inserattitel']/text()").extract_first()
        info_Berufsfelder = info.xpath(".//article/div[@class='border-top-grau']/p/text()").extract()
        info_Arbeitspensum = info.xpath(".//article/div[@class='row bg-hellstblau']/div[@class='col-12 col-sm-6 col-lg-5']/dl/dd[1]/text()").extract_first()
        info_Anstellungsverhältnis = info.xpath(".//article/div[@class='row bg-hellstblau']/div[@class='col-12 col-sm-6 col-lg-5']/dl/dd[2]/text()").extract_first()
        info_Arbeitsort = info.xpath(".//article/div[@class='row bg-hellstblau']/div[@class='col-12 col-sm-6 col-lg-5']/dl/dd[4]/a/text()").extract()
        info_VerfügbarAb = info.xpath(".//article/div[@class='row bg-hellstblau']/div[@class='col-12 col-sm-6 col-lg-5']/dl/dd[5]/text()").extract()
        info_Kompetenzenqualifikation = info.xpath(".//article/div[@class='row bg-hellstblau']/div[@class='col-12 col-sm-6 col-lg-7']/dl[2]/dd/text()").extract_first()
        info_Aufgabengebiet = info.xpath(".//article/div[@class='border-bottom-grau'][1]//*[self::p or self::li]").extract()
        info_Erwartungen = info.xpath(".//article/div[@class='border-bottom-grau'][2]/ul/li[descendant-or-self::text()]").extract()
        info_WirBietenIhnen = info.xpath(".//article/div[@class='border-bottom-grau'][3]/ul/li[descendant-or-self::text()]").extract()
        info_Publikationsdatum = info.xpath(".//article/footer[@class='inseratfooter']/p[1]/strong/text()").extract_first()

        yield {'Titel': info_Titel,
        'Berufsfelder': info_Berufsfelder,
        'Arbeitspensum': info_Arbeitspensum,
        'Anstellungsverhältnis': info_Anstellungsverhältnis,
        'Arbeitsort': info_Arbeitsort,
        'VerfügbarAb': info_VerfügbarAb,
        'Kompetenzenqualifikation': info_Kompetenzenqualifikation,
        'Aufgabengebiet': info_Aufgabengebiet,
        'Erwartungen': info_Erwartungen,
        'WirBietenIhnen': info_WirBietenIhnen,
        'Publikationsdatum': info_Publikationsdatum}

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM