https://www.kralilan.com/liste/kiralik-bina
This is the website I am trying to scrape. When you open the website, the listings are generated with an ajax request. The same request keeps populating page whenever you scroll down. This is how they implemented infinite scrolling...
I found out this is the request sent to the server when I scroll down and I tried to simulate the same request with headers and request payload. This is my spider.
class MySpider(scrapy.Spider):
name = 'kralilanspider'
allowed_domains = ['kralilan.com']
start_urls = [
'https://www.kralilan.com/liste/satilik-bina'
]
def parse(self, response):
headers = {'Referer': 'https://www.kralilan.com/liste/kiralik-bina',
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:67.0) Gecko/20100101 Firefox/67.0',
'Accept': 'application/json, text/javascript, */*; q=0.01',
'Accept-Language': 'en-US,en;q=0.5',
'Accept-Encoding': 'gzip, deflate, br',
#'Content-Type': 'application/json; charset=utf-8',
#'X-Requested-With': 'XMLHttpRequest',
#'Content-Length': 246,
#'Connection': 'keep-alive',
}
yield scrapy.Request(
url='https://www.kralilan.com/services/ki_operation.asmx/getFilter',
method='POST',
headers=headers,
callback=self.parse_ajax
)
def parse_ajax(self, response):
yield {'data': response.text}
response.body
, I get TypeError: Object of type bytes is not JSON serializable
. What am I missing here?
The following implementation will fetch you the response you would like to grab. You missed the most important part data
to pass as a parameter in your post requests.
import json
import scrapy
class MySpider(scrapy.Spider):
name = 'kralilanspider'
data = {'incomestr':'["Bina","1",-1,-1,-1,-1,-1,5]', 'intextstr':'{"isCoordinates":false,"ListDrop":[],"ListText":[{"id":"78","Min":"","Max":""},{"id":"107","Min":"","Max":""}],"FiyatData":{"Max":"","Min":""}}', 'index':0 , 'count':'10' , 'opt':'1' , 'type':'3'}
def start_requests(self):
yield scrapy.Request(
url='https://www.kralilan.com/services/ki_operation.asmx/getFilter',
method='POST',
body=json.dumps(self.data),
headers={"content-type": "application/json"}
)
def parse(self, response):
items = json.loads(response.text)['d']
yield {"data":items}
In case you wanna parse data from multiple pages (new page index is recorded when you scroll downward), the following will do the trick. The pagination is within index
key in your data.
import json
import scrapy
class MySpider(scrapy.Spider):
name = 'kralilanspider'
data = {'incomestr':'["Bina","1",-1,-1,-1,-1,-1,5]', 'intextstr':'{"isCoordinates":false,"ListDrop":[],"ListText":[{"id":"78","Min":"","Max":""},{"id":"107","Min":"","Max":""}],"FiyatData":{"Max":"","Min":""}}', 'index':0 , 'count':'10' , 'opt':'1' , 'type':'3'}
headers = {"content-type": "application/json"}
url = 'https://www.kralilan.com/services/ki_operation.asmx/getFilter'
def start_requests(self):
yield scrapy.Request(
url=self.url,
method='POST',
body=json.dumps(self.data),
headers=self.headers,
meta={'index': 0}
)
def parse(self, response):
items = json.loads(response.text)['d']
res = scrapy.Selector(text=items)
for item in res.css(".list-r-b-div"):
title = item.css(".add-title strong::text").get()
price = item.css(".item-price::text").get()
yield {"title":title,"price":price}
page = response.meta['index'] + 1
self.data['index'] = page
yield scrapy.Request(self.url, headers=self.headers, method='POST', body=json.dumps(self.data), meta={'index': page})
Why do you ignore POST body
? You need to submit it too:
def parse(self, response):
headers = {'Referer': 'https://www.kralilan.com/liste/kiralik-bina',
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:67.0) Gecko/20100101 Firefox/67.0',
'Accept': 'application/json, text/javascript, */*; q=0.01',
'Accept-Language': 'en-US,en;q=0.5',
'Accept-Encoding': 'gzip, deflate, br',
'Content-Type': 'application/json; charset=utf-8',
'X-Requested-With': 'XMLHttpRequest',
#'Content-Length': 246,
#'Connection': 'keep-alive',
}
payload = """
{ incomestr:'["Bina","2",-1,-1,-1,-1,-1,5]', intextstr:'{"isCoordinates":false,"ListDrop":[],"ListText":[{"id":"78","Min":"","Max":""},{"id":"107","Min":"","Max":""}],"FiyatData":{"Max":"","Min":""}}', index:'0' , count:'10' , opt:'1' , type:'3'}
"""
yield scrapy.Request(
url='https://www.kralilan.com/services/ki_operation.asmx/getFilter',
method='POST',
body=payload,
headers=headers,
callback=self.parse_ajax
)
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.