简体   繁体   中英

How to make a Scrapy POST request with a token in the payload?

I'm trying to scrape all 22 jobs on this webpage and then a bunch more from other companies that are using the same system to host their jobs.

I can get the first 10 jobs on the page, but the rest have to be loaded 10 at a time by clicking on a 'Show more' button. The URL doesn't change when you do that, and the only change I can see is that a token is added to the payload of the POST request.

Image of Request Payload in Network tool

I've tried following the answers for this stackexchange question and this one but I still can't get it to work.

Here's my current code:

  def start_requests(self):
    url = 'https://apply.workable.com/api/v3/accounts/so-energy/jobs'
    headers = {'authority': 'https://apply.workable.com'}
    payload = {
      "token":"WzE2NjI2ODE2MDAwMDAsMjY0NTU4N10=",
      "query":"",
      "location":[],
      "department":[],
      "worktype":[],
      "remote":[]}
    yield scrapy.Request(url = url,
                          method='POST',
                          headers = headers,
                          body = json.dumps(payload),
                          callback = self.parse)
    
  def parse(self, response):
    data = json.loads(response.body)
    print(data)

This gives me the first 10 jobs, but no more. I get exactly the same result if I remove the payload bits.

Any ideas?

(I'm very new to coding and this is my first question here, so apologies if I've missed something obvious but I've been trying to get this for hours. Thank you!)

You need to get the nextPage value from the JSON and use it in the payload for the next page.

from json import dumps
import scrapy


class ExampleSpider(scrapy.Spider):
    name = 'exampleSpider'
    API_url = 'https://apply.workable.com/api/v3/accounts/so-energy/jobs'
    custom_settings = {'DOWNLOAD_DELAY': 0.6}
    payload = {
        "department": [],
        "location": [],
        "query": "",
        "remote": [],
        "worktype": []
    }
    headers = {
        "Accept": "application/json, text/plain, */*",
        "Accept-Encoding": "gzip, deflate, br",
        "Accept-Language": "en",
        "Cache-Control": "no-cache",
        "Connection": "keep-alive",
        "Content-Type": "application/json",
        "DNT": "1",
        "Host": "apply.workable.com",
        "Origin": "https://apply.workable.com",
        "Pragma": "no-cache",
        "Referer": "https://apply.workable.com/so-energy/",
        "Sec-Fetch-Dest": "empty",
        "Sec-Fetch-Mode": "cors",
        "Sec-Fetch-Site": "same-origin",
        "Sec-GPC": "1",
        "TE": "trailers",
        "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.169 Safari/537.36"
    }

    def start_requests(self):
        yield scrapy.Request(url=self.API_url, headers=self.headers, body=dumps(self.payload), method="POST")

    def parse(self, response):
        # jobs
        data = response.json()
        for job in data['results']:
            yield {'job_details': job}

        # next page
        if 'nextPage' in data:
            self.payload['token'] = data['nextPage']
            yield scrapy.Request(url=self.API_url, headers=self.headers, body=dumps(self.payload), method="POST")


The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM