简体   繁体   English

请求:无法正确下载zip文件,且验证为False

[英]requests: cannot download zip file properly with verify is False

I am using Python 2.7.3 and requests(requests==2.10.0). 我正在使用Python 2.7.3和request(requests == 2.10.0)。 I am trying to get some zipfile from some link. 我正在尝试从某个链接获取一些zipfile。 The website's certificate is not verified but I just want to download that zip so I used verfiy=False . 该网站的证书尚未通过验证,但我只想下载该zip文件,因此我使用了verfiy=False

link = 'https://webapi.yanoshin.jp/rde.php?https%3A%2F%2Fdisclosure.edinet-fsa.go.jp%2FE01EW%2Fdownload%3Fuji.verb%3DW0EZA104CXP001006BLogic%26uji.bean%3Dee.bean.parent.EECommonSearchBean%26lgKbn%3D2%26no%3DS1007NMV'
r = requests.get(link, timeout=10, verify=False)
print r.content
# 'GIF89a\x01\x00\x01\x00\x80\x00\x00\x00\x00\x00\xff\xff\xff!\xf9\x04\x01\x00\x00\x01\x00,\x00\x00\x00\x00\x01\x00\x01\x00@\x02\x02L\x01\x00;'
print r.headers
# {'Content-Length': '43', 'Via': '1.0 localhost (squid/3.1.19)', 'X-Cache': 'MISS from localhost', 'X-Cache-Lookup': 'MISS from localhost:3128', 'Server': 'Apache', 'Connection': 'keep-alive', 'Date': 'Mon, 06 Jun 2016 07:59:52 GMT', 'Content-Type': 'image/gif'}

However, I tried with Firefox & Chromium, if I choose to trust that cert, I will be able to download zip file. 但是,我尝试使用Firefox&Chromium,如果我选择信任该证书,则可以下载zip文件。 wget --no-check-certificate [that link] results in a zip file with correct size as well. wget --no-check-certificate [that link]也会生成具有正确大小的zip文件。

(I wrote that gif to disk and checked, no content, just too small in terms of file size) (我将gif写入磁盘并检查,没有内容,就文件大小而言太小了)

Maybe it is a header issue? 也许是标题问题? I do not know. 我不知道。 I can use wget of course. 我当然可以使用wget。 Just want to figure out the reason behind this and make this work. 只是想弄清楚背后的原因并使其起作用。

(Browser will download some zip, 23.4KB)( wget [link] -O test.zip will download the zip file as well) (浏览器将下载一些23.4KB的zip文件)( wget [link] -O test.zip也将下载该zip文件)

The server is trying to block scripts from downloading ZIP files; 服务器正在尝试阻止脚本下载ZIP文件。 you'll see the same issue when using curl : 使用curl时,您会看到相同的问题:

$ curl -sD - -o /dev/null "https://webapi.yanoshin.jp/rde.php?https%3A%2F%2Fdisclosure.edinet-fsa.go.jp%2FE01EW%2Fdownload%3Fuji.verb%3DW0EZA104CXP001006BLogic%26uji.bean%3Dee.bean.parent.EECommonSearchBean%26lgKbn%3D2%26no%3DS1007NUS"
HTTP/1.1 302 Found
Server: nginx
Date: Mon, 06 Jun 2016 08:56:20 GMT
Content-Type: text/html; charset=UTF-8
Transfer-Encoding: chunked
Connection: keep-alive
X-Powered-By: PHP/7.0.7
Location: https://disclosure.edinet-fsa.go.jp/E01EW/download?uji.verb=W0EZA104CXP001006BLogic&uji.bean=ee.bean.parent.EECommonSearchBean&lgKbn=2&no=S1007NUS

Notice the text/html response. 注意text/html响应。

The server seems to be looking for browser-specific Accept and User-Agent headers; 服务器似乎在寻找特定于浏览器的AcceptUser-Agent标头。 copying the Accept header Chrome sends, plus adding a minimal User-Agent string, seems to be enough to fool the server: 复制Chrome发送的Accept标头,再加上最小的User-Agent字符串,似乎足以使服务器傻瓜:

>>> r = requests.get(link, timeout=10, headers={'User-Agent': 'Mozilla/5.0', 'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8'}, verify=False)
# ... two warnings about ignoring the certificate ...
>>> r.headers
{'Content-Length': '14078', 'Content-Disposition': 'inline;filename="Xbrl_Search_20160606_175759.zip"', 'Set-Cookie': 'FJNADDSPID=3XWzlS; expires=Mon, 05-Sep-2016 08:57:59 GMT; path=/; secure, JSESSIONID=6HIMAP1I60PJ2P9HC5H3AC1N68PJAOR568RJIEB5CGS3I0UITOI5A08000P00000.E01EW_001; Path=/E01EW; secure', 'Connection': 'close', 'X-UA-Compatible': 'IE=EmulateIE9', 'Date': 'Mon, 06 Jun 2016 08:57:59 GMT', 'Content-Type': 'application/octet-stream'}

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM