[英]Scrapy not following next page
我在這個問題上坐了很長時間,但我嘗試的一切都不起作用。 我的目標是簡單地從招聘網站中提取數據。 每個站點提供 20 個工作崗位。 我正在使用 scrapy 回調提取每個報價的數據。 這或多或少有效。 問題是 scrapy 不跳到下一頁,不管我怎么試。 我首先嘗試了 scrapy & selenium,不起作用。 現在我只嘗試使用 scrapy 並遵循教程,但它仍然只從第 1 頁的前 20 個報價中提取數據。
重要提示:下一個按鈕會更改整個頁面,這意味着它的 xpath/css 選擇器會更改。 我試過 css last-nth-child 和 xpath last()-1 但沒有令人滿意的結果。 更難的是,在變量 xpath 元素 aa 標記后面跟着鏈接。
這是代碼:
import scrapy
from random import randint
from time import sleep
class WorkpoolJobsSpider(scrapy.Spider):
name = 'getdata'
allowed_domains = ['workpool-jobs.ch']
start_urls = ['https://www.workpool-jobs.ch/recht-jobs']
def parse(self, response):
SET_SELECTOR = "//p[@class='inserattitel h2 mt-0']/a/@href"
for joboffer in response.xpath(SET_SELECTOR):
url1 = response.urljoin(joboffer.get())
yield scrapy.Request(url1, callback = self.parse_dir_contents)
next_page = response.xpath(".//li[@class='page-item'][last()-1]/../@href").get()
wait(randint(5,10))
if next_page:
yield response.follow(url=next_page, callback=self.parse)
def parse_dir_contents(self, response):
single_info = response.xpath(".//*[@class='col-12 col-md mr-md-3 mr-xl-5']")
for info in single_info:
info_Titel = info.xpath(".//article/h1[@class='inserattitel']/text()").extract_first()
info_Berufsfelder = info.xpath(".//article/div[@class='border-top-grau']/p/text()").extract()
info_Arbeitspensum = info.xpath(".//article/div[@class='row bg-hellstblau']/div[@class='col-12 col-sm-6 col-lg-5']/dl/dd[1]/text()").extract_first()
info_Anstellungsverhältnis = info.xpath(".//article/div[@class='row bg-hellstblau']/div[@class='col-12 col-sm-6 col-lg-5']/dl/dd[2]/text()").extract_first()
info_Arbeitsort = info.xpath(".//article/div[@class='row bg-hellstblau']/div[@class='col-12 col-sm-6 col-lg-5']/dl/dd[4]/a/text()").extract()
info_VerfügbarAb = info.xpath(".//article/div[@class='row bg-hellstblau']/div[@class='col-12 col-sm-6 col-lg-5']/dl/dd[5]/text()").extract()
info_Kompetenzenqualifikation = info.xpath(".//article/div[@class='row bg-hellstblau']/div[@class='col-12 col-sm-6 col-lg-7']/dl[2]/dd/text()").extract_first()
info_Aufgabengebiet = info.xpath(".//article/div[@class='border-bottom-grau'][1]//*[self::p or self::li]").extract()
info_Erwartungen = info.xpath(".//article/div[@class='border-bottom-grau'][2]/ul/li[descendant-or-self::text()]").extract()
info_WirBietenIhnen = info.xpath(".//article/div[@class='border-bottom-grau'][3]/ul/li[descendant-or-self::text()]").extract()
info_Publikationsdatum = info.xpath(".//article/footer[@class='inseratfooter']/p[1]/strong/text()").extract_first()
yield {'Titel': info_Titel,
'Berufsfelder': info_Berufsfelder,
'Arbeitspensum': info_Arbeitspensum,
'Anstellungsverhältnis': info_Anstellungsverhältnis,
'Arbeitsort': info_Arbeitsort,
'VerfügbarAb': info_VerfügbarAb,
'Kompetenzenqualifikation': info_Kompetenzenqualifikation,
'Aufgabengebiet': info_Aufgabengebiet,
'Erwartungen': info_Erwartungen,
'WirBietenIhnen': info_WirBietenIhnen,
'Publikationsdatum': info_Publikationsdatum}
非常感謝任何幫助!
有了一些來自 furas 的提示,我終於設法讓我的代碼正常工作。 如果將來有人遇到同樣的問題,也許我下面的代碼也可以幫助您:
import scrapy
from random import randint
from time import sleep
class WorkpoolJobsSpider(scrapy.Spider):
name = "getdata"
page_number = 2
allowed_domains = ["workpool-jobs.ch"]
start_urls = ["https://www.workpool-jobs.ch/recht-jobs"]
def parse(self, response):
SET_SELECTOR = "//p[@class='inserattitel h2 mt-0']/a/@href"
for joboffer in response.xpath(SET_SELECTOR):
url1 = response.urljoin(joboffer.get())
yield scrapy.Request(url1, callback = self.parse_dir_contents)
next_page = "https://www.workpool-jobs.ch/recht-jobs?seite=" + str(WorkpoolJobsSpider.page_number)
sleep(randint(5,10))
if WorkpoolJobsSpider.page_number < 27:
WorkpoolJobsSpider.page_number += 1
yield response.follow(next_page, callback=self.parse)
def parse_dir_contents(self, response):
single_info = response.xpath(".//*[@class='col-12 col-md mr-md-3 mr-xl-5']")
for info in single_info:
info_Titel = info.xpath(".//article/h1[@class='inserattitel']/text()").extract_first()
info_Berufsfelder = info.xpath(".//article/div[@class='border-top-grau']/p/text()").extract()
info_Arbeitspensum = info.xpath(".//article/div[@class='row bg-hellstblau']/div[@class='col-12 col-sm-6 col-lg-5']/dl/dd[1]/text()").extract_first()
info_Anstellungsverhältnis = info.xpath(".//article/div[@class='row bg-hellstblau']/div[@class='col-12 col-sm-6 col-lg-5']/dl/dd[2]/text()").extract_first()
info_Arbeitsort = info.xpath(".//article/div[@class='row bg-hellstblau']/div[@class='col-12 col-sm-6 col-lg-5']/dl/dd[4]/a/text()").extract()
info_VerfügbarAb = info.xpath(".//article/div[@class='row bg-hellstblau']/div[@class='col-12 col-sm-6 col-lg-5']/dl/dd[5]/text()").extract()
info_Kompetenzenqualifikation = info.xpath(".//article/div[@class='row bg-hellstblau']/div[@class='col-12 col-sm-6 col-lg-7']/dl[2]/dd/text()").extract_first()
info_Aufgabengebiet = info.xpath(".//article/div[@class='border-bottom-grau'][1]//*[self::p or self::li]").extract()
info_Erwartungen = info.xpath(".//article/div[@class='border-bottom-grau'][2]/ul/li[descendant-or-self::text()]").extract()
info_WirBietenIhnen = info.xpath(".//article/div[@class='border-bottom-grau'][3]/ul/li[descendant-or-self::text()]").extract()
info_Publikationsdatum = info.xpath(".//article/footer[@class='inseratfooter']/p[1]/strong/text()").extract_first()
yield {'Titel': info_Titel,
'Berufsfelder': info_Berufsfelder,
'Arbeitspensum': info_Arbeitspensum,
'Anstellungsverhältnis': info_Anstellungsverhältnis,
'Arbeitsort': info_Arbeitsort,
'VerfügbarAb': info_VerfügbarAb,
'Kompetenzenqualifikation': info_Kompetenzenqualifikation,
'Aufgabengebiet': info_Aufgabengebiet,
'Erwartungen': info_Erwartungen,
'WirBietenIhnen': info_WirBietenIhnen,
'Publikationsdatum': info_Publikationsdatum}
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.