如何從刮掉的url中刮取元素？ Scrapy

Question

好的，假設我有一個列出工作機會的網站，並且有多個頁面（動態的，這就是我使用 selenium 的原因）。 我想做的事：

刮掉多頁上的每個作業帖子URL ，
從每個URL幾個項目（標題，本地化等）中抓取


class JobScraper(scrapy.Spider):
    name = "jobscraper"
    allowed_domains = ['pracuj.pl']
    total = 10
    start_urls = [
        'https://www.pracuj.pl/praca/it%20-%20rozw%c3%b3j%20oprogramowania;cc,5016/%c5%82%c3%b3dzkie;r,5?rd=10&pn={}'.format(i)
        for i in range(1, total)
    ]
    custom_settings = {
        'LOG_LEVEL': 'INFO',
    }

    def __init__(self):
        self.options = webdriver.ChromeOptions()
        self.options.headless = True
        self.driver = webdriver.Chrome(r'C:\Users\kacpe\OneDrive\Pulpit\Python\Projekty\chromedriver.exe',
                                       options=self.options)

    def parse(self, response):
        self.driver.get(response.url)
        res = response.replace(body=self.driver.page_source)

        offers = res.xpath('//li[contains(@class, "results__list-container")]')
         for offer in offers:
            link = offer.xpath('.//a[@class="offer-details__title-link"]/@href').extract()
            yield Request(link, callback=self.parse_page)

    def parse_page(self, response):
        title = response.xpath('//h1[@data-scroll-id="job-title"]/text()').extract()
        yield {
            'job_title': title
        }

而且它不起作用，發生了一個錯誤：

TypeError: 請求 url 必須是 str 或 unicode，得到列表

Answer 1

您在這一行中調用extract ：

link = offer.xpath('.//a[@class="offer-details__title-link"]/@href').extract()

Extract 返回一個元素列表，這就是您嘗試將link傳遞給Request時收到錯誤的原因。

根據您想要執行的操作，您可以for link in links執行操作並Request每個結果，或者使用find_elements_by_xpath xpath

Answer 2

您不需要 selenium 來抓取所需的內容。 事實證明，您希望從該站點獲取的項目位於某個腳本標記中。 一旦您使用正則表達式挖出該部分並使用 json 庫處理它，您應該非常輕松地訪問它們。 以下是我的意思：

import json
import scrapy

class JobScraper(scrapy.Spider):
    name = "jobscraper"
    total = 10
    start_urls = [
        'https://www.pracuj.pl/praca/it%20-%20rozw%c3%b3j%20oprogramowania;cc,5016/%c5%82%c3%b3dzkie;r,5?rd=10&pn={}'.format(i)
        for i in range(1, total)
    ]

    headers = {
        'User-Agent': 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/89.0.4389.114 Safari/537.36'
    }
    def start_requests(self):
        for start_url in self.start_urls:
            yield scrapy.Request(start_url,callback=self.parse,headers=self.headers)

    def parse(self, response):
        items = response.css("script:contains('window.__INITIAL_STATE__')::text").re_first(r"window\.__INITIAL_STATE__ =(.*);")
        for item in json.loads(items)['offers']:
            yield {
                "title":item['jobTitle'],
                "employer":item['employer'],
                "country":item['countryName'],
                "details_page":item['companyProfileUrl']
            }

如何從刮掉的url中刮取元素？ Scrapy

問題描述

2 個解決方案

解決方案1
2 2021-04-11 16:49:53

解決方案2
2 2021-04-11 17:14:00

如何從刮掉的url中刮取元素？ Scrapy

問題描述

2 個解決方案

解決方案1 2 2021-04-11 16:49:53

解決方案2 2 2021-04-11 17:14:00

解決方案1
2 2021-04-11 16:49:53

解決方案2
2 2021-04-11 17:14:00