[英]Scrapy - Request Payload format and types
This is the starting point of my scraping process. 这是我的抓取过程的起点。
https://www.storiaimoveis.com.br/alugar/brasil https://www.storiaimoveis.com.br/alugar/brasil
This is the AJAX call which returns the data in a JSON format for every page. 这是AJAX调用,它以JSON格式返回每个页面的数据。
https://www.storiaimoveis.com.br/api/search?fields=%24%24meta.geo.postalCodeAddress.city%2C%24%24meta.geo.postalCodeAddress.neighborhood%2C%24%24meta.geo.postalCodeAddress.street%2C%24%24meta.location%2C%24%24meta.created%2Caddress.number%2Caddress.postalCode%2Caddress.neighborhood%2Caddress.state%2Cmedia%2ClivingArea%2CtotalArea%2Ctypes%2Coperation%2CsalePrice%2CrentPrice%2CnewDevelopment%2CadministrationFee%2CyearlyTax%2Caccount.logoUrl%2Caccount.name%2Caccount.id%2Caccount.creci%2Cgarage%2Cbedrooms%2Csuites%2Cbathrooms%2Cref&optimizeMedia=true&size=20&from=0&sessionId=5ff29d7e-88d0-54d5-2641-e203cafd6f4e https://www.storiaimoveis.com.br/api/search?fields=%24%24meta.geo.postalCodeAddress.city%2C%24%24meta.geo.postalCodeAddress.neighborhood%2C%24%24meta.geo.postalCodeAddress .street%2C%24%24meta.location%2C%24%24meta.created%2Caddress.number%2Caddress.postalCode%2Caddress.neighborhood%2Caddress.state%2Cmedia%2ClivingArea%2CtotalArea%2Ctypes%2Coperation%2CsalePrice%2CrentPrice%2CnewDevelopment %2CadministrationFee%2CyearlyTax%2Caccount.logoUrl%2Caccount.name%2Caccount.id%2Caccount.creci%2Cgarage%2Cbedrooms%2Csuites%2Cbathrooms%2Cref&optimizeMedia =真大小= 20&从= 0&的sessionId = 5ff29d7e-88d0-54d5-2641-e203cafd6f4e
My POST request fails with error 404. Those requests require payloads gave me trouble in the past. 我的POST请求失败,错误404。这些请求需要有效负载,这给我带来了麻烦。 I always solved the problem somehow, but now I'm trying to understand what am I doing wrong with them. 我总是以某种方式解决问题,但现在我想了解我对他们的错。
My questions are; 我的问题是;
json.dumps(payload)
before sending them, or send them as dictionaries?. 我需要在发送它们之前将其调用json.dumps(payload)
还是将其作为字典发送? This is my code's relevant parts. 这是我的代码的相关部分。
class MySpider(CrawlSpider):
name = 'myspider'
start_urls = [
'https://www.storiaimoveis.com.br/api/search?fields=%24%24meta.geo.postalCodeAddress.city%2C%24%24meta.geo.postalCodeAddress.neighborhood%2C%24%24meta.geo.postalCodeAddress.street%2C%24%24meta.location%2C%24%24meta.created%2Caddress.number%2Caddress.postalCode%2Caddress.neighborhood%2Caddress.state%2Cmedia%2ClivingArea%2CtotalArea%2Ctypes%2Coperation%2CsalePrice%2CrentPrice%2CnewDevelopment%2CadministrationFee%2CyearlyTax%2Caccount.logoUrl%2Caccount.name%2Caccount.id%2Caccount.creci%2Cgarage%2Cbedrooms%2Csuites%2Cbathrooms%2Cref&optimizeMedia=true&size=20&from=0&sessionId=5ff29d7e-88d0-54d5-2641-e203cafd6f4e'
]
page = 1
payload = {"locations":[{"geo":{"top_left":{"lat":5.2717863,
"lon":-73.982817},
"bottom_right":{"lat":-34.0891,
"lon":-28.650543}},
"placeId":"ChIJzyjM68dZnAARYz4p8gYVWik",
"keywords":"Brasil",
"address":{"label":"Brasil","country":"BR"}}],
"operation":["RENT"],
"bathrooms":[],
"bedrooms":[],
"garage":[],
"features":[]}
headers = {
'Accept': 'application/json',
'Content-Type': 'application/json',
'Referer': 'https://www.storiaimoveis.com.br/alugar/brasil',
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.100 Safari/537.36'
}
def parse(self, response):
for url in self.start_urls:
yield scrapy.Request(url=url,
method='POST',
headers=self.headers,
body=json.dumps(self.payload),
callback=self.parse_items)
def parse_items(self, response):
from scrapy.shell import inspect_response
inspect_response(response, self)
print response.text
Yes, you need to call json.dumps(payload)
because the request body needs to be str or unicode
as stated in the documentation: https://docs.scrapy.org/en/latest/topics/request-response.html#request-objects 是的,您需要调用json.dumps(payload)
因为请求主体需要按文档中所述以str or unicode
表示: https : //docs.scrapy.org/en/latest/topics/request-response.html#请求的对象
But, in your case, your request fails because of these 2 missing headers: Content-Type
and Referer
. 但是,在您的情况下,由于缺少这两个标头,您的请求失败: Content-Type
和Referer
。
What I usually do in order to get the right request headers is this: 为了获得正确的请求标头,我通常要做的是:
curl
or Postman
to make the request until I get the right headers. 使用curl
或Postman
发出请求,直到获得正确的标题为止。 In this case, Content-Type
and Referer
seem to be enough for an HTTP 200 response status: 在这种情况下, Content-Type
和Referer
似乎足以满足HTTP 200响应状态:
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.