简体   繁体   中英

scrapy returns 401 unauthorized response

site is https://www.extratodebito.detran.pr.gov.br/detranextratos/geraExtrato.do?action=iniciarProcesso

        yield Request(self.url, callback=self.login_me, dont_filter=True)

returns <html><head><title>Error</title></head><body>Unauthorized</body></html>

but if I do using requests library it's working good!

any reason why it happens ?

UPDATE:

normal headers looks like

Host: www.extratodebito.detran.pr.gov.br
Connection: keep-alive
Cache-Control: max-age=0
Upgrade-Insecure-Requests: 1
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/85.0.4183.102 Safari/537.36
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9
Sec-Fetch-Site: none
Sec-Fetch-Mode: navigate
Sec-Fetch-User: ?1
Sec-Fetch-Dest: document
Accept-Encoding: gzip, deflate, br
Accept-Language: en-US,en;q=0.9

I added it to scrapy, but I can see Authorization field there that was added during the request

User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/85.0.4183.102 Safari/537.36
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9
Accept-Encoding: gzip, deflate, br
Accept-Language: en-US,en;q=0.9
Authorization: Basic MTM2ZGNjNmFhOWZmNDA1Njk1YWU1MWE0ZjI1MzZlYzE6
Host: www.extratodebito.detran.pr.gov.br

UPDATE 2:

solved by removing http_user and http_pass in spider that using for splash, but also sent to usual requests with scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware

It's working fine for me when adding the Accept, Accept-Language and Accept-Encoding headers. I tested it out in scrapy shell :

headers = {'Accept': ['text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8'], b'Accept-Language': ['en'], 'Accept-Encoding': ['gzip,deflate,br']}
url = "https://www.extratodebito.detran.pr.gov.br/detranextratos/geraExtrato.do?action=iniciarProcesso"
from scrapy import Request
req = Request(url, headers=headers)
fetch(req)

I got a 200 response:

2020-09-14 11:16:03 [scrapy.core.engine] INFO: Spider opened
2020-09-14 11:16:04 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.extratodebito.detran.pr.gov.br/detranextratos/geraExtrato.do?action=iniciarProcesso> (referer: None)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM