简体   繁体   中英

how to make a POST request in Scrapy that requires Request payload

I am trying to parse data from this website .
In Network section of inspect element i found this link https://busfor.pl/api/v1/searches that is used for a POST request that returns JSON i am interested in.
But for making this POST request there is request Payload with some dictionary.
I assumed it like normal formdata that we use to make FormRequest in scrapy but it returns 403 error.

I have already tried the following.

url = "https://busfor.pl/api/v1/searches"
formdata = {"from_id" : d_id
                ,"to_id" : a_id
                ,"on" : '2019-10-10'
                ,"passengers" : 1
                ,"details" : []
}
yield scrapy.FormRequest(url, callback=self.parse, formdata=formdata)

This returns 403 Error
I also tried this by referring to one of the StackOverflow post.

url = "https://busfor.pl/api/v1/searches"
payload = [{"from_id" : d_id
                ,"to_id" : a_id
                ,"on" : '2019-10-10'
                ,"passengers" : 1
                ,"details" : []
}]
yield scrapy.Request(url, self.parse, method = "POST", body = json.dumps(payload))

But even this returns the same error.
Can someone help me. to figure out how to parse the required data using Scrapy.

The way to send POST requests with json data is the later, but you are passing a wrong json to the site, it expects a dictionary, not a list of dictionaries. So instead of:

payload = [{"from_id" : d_id
                ,"to_id" : a_id
                ,"on" : '2019-10-10'
                ,"passengers" : 1
                ,"details" : []
}]

You should use:

payload = {"from_id" : d_id
                ,"to_id" : a_id
                ,"on" : '2019-10-10'
                ,"passengers" : 1
                ,"details" : []
}

Another thing you didn't notice are the headers passed to the POST request, sometimes the site uses IDs and hashes to control access to their API, in this case I found two values that appear to be needed, X-CSRF-Token and X-NewRelic-ID . Luckily for us these two values are available on the search page.

Here is a working spider, the search result is available at the method self.parse_search .

import json
import scrapy

class BusForSpider(scrapy.Spider):
    name = 'busfor'
    start_urls = ['https://busfor.pl/autobusy/Sopot/Gda%C5%84sk?from_id=62113&on=2019-10-09&passengers=1&search=true&to_id=3559']
    search_url = 'https://busfor.pl/api/v1/searches'

    def parse(self, response):
        payload = {"from_id" : '62113',
                   "to_id" : '3559',
                   "on" : '2019-10-10',
                   "passengers" : 1,
                   "details" : []}
        csrf_token = response.xpath('//meta[@name="csrf-token"]/@content').get()
        newrelic_id = response.xpath('//script/text()').re_first(r'xpid:"(.*?)"')
        headers = {
            'X-CSRF-Token': csrf_token,
            'X-NewRelic-ID': newrelic_id,
            'Content-Type': 'application/json; charset=UTF-8',
        }
        yield scrapy.Request(self.search_url, callback=self.parse_search, method="POST", body=json.dumps(payload), headers=headers)

    def parse_search(self, response):
        data = json.loads(response.text)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM