HTTP错误403：禁止使用urlretrieve

Question

I am trying to download a PDF, however I get the following error: HTTP Error 403: Forbidden 我正在尝试下载PDF，但是出现以下错误：HTTP错误403：禁止

I am aware that the server is blocking for whatever reason, but I cant seem to find a solution. 我知道服务器由于某种原因而阻塞，但是我似乎找不到解决方案。

import urllib.request
import urllib.parse
import requests


def download_pdf(url):

full_name = "Test.pdf"
urllib.request.urlretrieve(url, full_name)


try: 
url =         ('http://papers.xtremepapers.com/CIE/Cambridge%20IGCSE/Mathematics%20(0580)/0580_s03_qp_1.pdf')

print('initialized')

hdr = {}
hdr = {
'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_2)     AppleWebKit/537.36 (KHTML, like Gecko) Chrome/47.0.2526.106 Safari/537.36',
'Content-Length': '136963',
}



print('HDR recieved')

req = urllib.request.Request(url, headers=hdr)

print('Header sent')

resp = urllib.request.urlopen(req)

print('Request sent')

respData = resp.read()

download_pdf(url)


print('Complete')

except Exception as e:
print(str(e))

Answer 1

You seem to have already realised this; 您似乎已经意识到这一点； the remote server is apparently checking the user agent header and rejecting requests from Python's urllib. 远程服务器显然正在检查用户代理标头并拒绝来自Python的urllib的请求。 But urllib.request.urlretrieve() doesn't allow you to change the HTTP headers, however, you can use urllib.request.URLopener.retrieve() : 但是urllib.request.urlretrieve()不允许您更改HTTP标头，但是，您可以使用urllib.request.URLopener.retrieve() ：

import urllib.request

opener = urllib.request.URLopener()
opener.addheader('User-Agent', 'whatever')
filename, headers = opener.retrieve(url, 'Test.pdf')

NB You are using Python 3 and these functions are now considered part of the "Legacy interface" , and URLopener has been deprecated. 注意：您正在使用Python 3，并且现在将这些功能视为“旧版界面”的一部分，并且不建议使用URLopener 。 For that reason you should not use them in new code. 因此，您不应在新代码中使用它们。

The above aside, you are going to a lot of trouble to simply access a URL. 除此之外，简单地访问URL会带来很多麻烦。 Your code imports requests , but you don't use it - you should though because it is much easier than urllib . 您的代码将导入requests ，但您不使用它-应该这样做，因为它比urllib容易得多。 This works for me: 这对我有用：

import requests

url = 'http://papers.xtremepapers.com/CIE/Cambridge%20IGCSE/Mathematics%20(0580)/0580_s03_qp_1.pdf'
r = requests.get(url)
with open('0580_s03_qp_1.pdf', 'wb') as outfile:
    outfile.write(r.content)

HTTP错误403：禁止使用urlretrieve

问题描述

1 个解决方案

解决方案1
7 已采纳 2016-01-22 23:50:04

HTTP错误403：禁止使用urlretrieve

问题描述

1 个解决方案

解决方案1 7 已采纳 2016-01-22 23:50:04

解决方案1
7 已采纳 2016-01-22 23:50:04