简体   繁体   English

Scrapy - 使用请求库发送到 API 的请求与使用 Scrapy.Request 发送的请求有何不同?

[英]Scrapy - How does a request sent using requests library to an API differs from the request that is sent using Scrapy.Request?

I am a beginner at using Scrapy and I was trying to scrape this website https://directory.ntschools.net/#/schools which is using javascript to load the contents.我是使用 Scrapy 的初学者,我试图抓取这个网站https://directory.ntschools.net/#/schools ,它使用 javascript 来加载内容。 So I checked the.networks tab and there's an API address available https://directory.ntschools.net/api/System/GetAllSchools If you open this address, the data is in XML format.所以我检查了.networks选项卡,有一个API地址可用https://directory.ntschools.net/api/System/GetAllSchools如果你打开这个地址,数据是XML格式。 But when you check the response tab while inspecting the.network tab, the data is there in json format.但是,当您在检查 .network 选项卡时检查响应选项卡时,数据以 json 格式存在。

I first tried using Scrapy, sent the request to the API address WITHOUT any headers and the response that it returned was in XML which was throwing JSONDecode error upon using json.loads().我首先尝试使用 Scrapy,将请求发送到没有任何标头的 API 地址,它返回的响应在 XML 中,在使用 json.loads() 时抛出 JSONDecode 错误。 So I used the header 'Accept': 'application/json' and the response I got was in JSON. That worked well所以我使用了 header 'Accept': 'application/json' 我得到的响应是 JSON。效果很好

import scrapy
import json
import requests

class NtseSpider_new(scrapy.Spider):
    name = 'ntse_new'
    header = {
        'Accept': 'application/json',
         'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/107.0.0.0 Safari/537.36 Edg/107.0.1418.56',
    }
    
    def start_requests(self):
        yield scrapy.Request('https://directory.ntschools.net/api/System/GetAllSchools',callback=self.parse,headers=self.header)



    def parse(self,response):
        data = json.loads(response.body) #returned json response

But then I used the requests module WITHOUT any headers and the response I got was in JSON too!但是后来我使用了没有任何标头的请求模块,我得到的响应也在 JSON 中!

import requests

import json


res = requests.get('https://directory.ntschools.net/api/System/GetAllSchools')

js = json.loads(res.content) #returned json response

Can anyone please tell me if there's any difference between both the types of requests?谁能告诉我这两种请求之间是否有任何区别? Is there a default response format for requests module when making a request to an API?向 API 发出请求时,请求模块是否有默认响应格式? Surely, I am missing something?当然,我错过了什么? Thanks谢谢

It's because Scrapy sets the Accept header to 'text/html,application/xhtml+xml,application/xml...'.这是因为 Scrapy 将Accept header 设置为 'text/html,application/xhtml+xml,application/xml...'。 You can see that from this .这里可以看出。

I experimented and found that server sends a JSON response if the request has no Accept header.我试验发现,如果请求没有Accept header,服务器会发送 JSON 响应。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM