简体   繁体   中英

Scrapy - Simulating AJAX requests with headers and request payload

https://www.kralilan.com/liste/kiralik-bina

This is the website I am trying to scrape. When you open the website, the listings are generated with an ajax request. The same request keeps populating page whenever you scroll down. This is how they implemented infinite scrolling...

请求

I found out this is the request sent to the server when I scroll down and I tried to simulate the same request with headers and request payload. This is my spider.

class MySpider(scrapy.Spider):

    name = 'kralilanspider'
    allowed_domains = ['kralilan.com']
    start_urls = [
        'https://www.kralilan.com/liste/satilik-bina'
    ]

    def parse(self, response):

        headers = {'Referer': 'https://www.kralilan.com/liste/kiralik-bina',
                   'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:67.0) Gecko/20100101 Firefox/67.0',
                   'Accept': 'application/json, text/javascript, */*; q=0.01',
                   'Accept-Language': 'en-US,en;q=0.5',
                   'Accept-Encoding': 'gzip, deflate, br',
                   #'Content-Type': 'application/json; charset=utf-8',
                   #'X-Requested-With': 'XMLHttpRequest',
                   #'Content-Length': 246,
                   #'Connection': 'keep-alive',
                   }

        yield scrapy.Request(
            url='https://www.kralilan.com/services/ki_operation.asmx/getFilter',
            method='POST',
            headers=headers,
            callback=self.parse_ajax
        )

    def parse_ajax(self, response):
        yield {'data': response.text}
  • If I uncomment the commented headers, request fails with status code 400 or 500.
  • I tried to send request payload as a body in the parse method. That didn't work either.
  • If I try to yield response.body , I get TypeError: Object of type bytes is not JSON serializable .

What am I missing here?

The following implementation will fetch you the response you would like to grab. You missed the most important part data to pass as a parameter in your post requests.

import json
import scrapy

class MySpider(scrapy.Spider):
    name = 'kralilanspider'
    data = {'incomestr':'["Bina","1",-1,-1,-1,-1,-1,5]', 'intextstr':'{"isCoordinates":false,"ListDrop":[],"ListText":[{"id":"78","Min":"","Max":""},{"id":"107","Min":"","Max":""}],"FiyatData":{"Max":"","Min":""}}', 'index':0 , 'count':'10' , 'opt':'1' , 'type':'3'}

    def start_requests(self):
        yield scrapy.Request(
            url='https://www.kralilan.com/services/ki_operation.asmx/getFilter',
            method='POST',
            body=json.dumps(self.data),
            headers={"content-type": "application/json"}
        )

    def parse(self, response):
        items = json.loads(response.text)['d']
        yield {"data":items}

In case you wanna parse data from multiple pages (new page index is recorded when you scroll downward), the following will do the trick. The pagination is within index key in your data.

import json
import scrapy

class MySpider(scrapy.Spider):
    name = 'kralilanspider'
    data = {'incomestr':'["Bina","1",-1,-1,-1,-1,-1,5]', 'intextstr':'{"isCoordinates":false,"ListDrop":[],"ListText":[{"id":"78","Min":"","Max":""},{"id":"107","Min":"","Max":""}],"FiyatData":{"Max":"","Min":""}}', 'index':0 , 'count':'10' , 'opt':'1' , 'type':'3'}
    headers = {"content-type": "application/json"}
    url = 'https://www.kralilan.com/services/ki_operation.asmx/getFilter'

    def start_requests(self):
        yield scrapy.Request(
            url=self.url,
            method='POST',
            body=json.dumps(self.data),
            headers=self.headers,
            meta={'index': 0}
        )

    def parse(self, response):
        items = json.loads(response.text)['d']
        res = scrapy.Selector(text=items)
        for item in res.css(".list-r-b-div"):
            title = item.css(".add-title strong::text").get()
            price = item.css(".item-price::text").get()
            yield {"title":title,"price":price}

        page = response.meta['index'] + 1
        self.data['index'] = page
        yield scrapy.Request(self.url, headers=self.headers, method='POST', body=json.dumps(self.data), meta={'index': page})

Why do you ignore POST body ? You need to submit it too:

    def parse(self, response):

        headers = {'Referer': 'https://www.kralilan.com/liste/kiralik-bina',
                   'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:67.0) Gecko/20100101 Firefox/67.0',
                   'Accept': 'application/json, text/javascript, */*; q=0.01',
                   'Accept-Language': 'en-US,en;q=0.5',
                   'Accept-Encoding': 'gzip, deflate, br',
                   'Content-Type': 'application/json; charset=utf-8',
                   'X-Requested-With': 'XMLHttpRequest',
                   #'Content-Length': 246,
                   #'Connection': 'keep-alive',
                   }

        payload = """
{ incomestr:'["Bina","2",-1,-1,-1,-1,-1,5]', intextstr:'{"isCoordinates":false,"ListDrop":[],"ListText":[{"id":"78","Min":"","Max":""},{"id":"107","Min":"","Max":""}],"FiyatData":{"Max":"","Min":""}}', index:'0' , count:'10' , opt:'1' , type:'3'}
"""
        yield scrapy.Request(
            url='https://www.kralilan.com/services/ki_operation.asmx/getFilter',
            method='POST',
            body=payload,
            headers=headers,
            callback=self.parse_ajax
        )

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM