简体   繁体   English

请求无法获取 pdf URL 并下载它

[英]Requests is unable to get a pdf URL and download it

For my job, we get a lot of product pdfs that we need to download.对于我的工作,我们需要下载很多产品 pdf。 This leads to long lists of urls that I'd rather not click on over and over again.这导致我不想一遍又一遍地点击一长串网址。 For some, I'm able to use the code below to download a pdf, but for others (like the one included) it seems like requests gets stuck in an endless loop of some sort when I ask it to get the url.对于某些人来说,我可以使用下面的代码来下载 pdf,但对于其他人(如包含的那个),当我要求它获取 url 时,请求似乎陷入了某种无限循环。

I've tried different parameters and different tips that I've seen elsewhere and nothing has worked.我尝试了在其他地方看到的不同参数和不同提示,但没有任何效果。 I'm new to code and to python so I'm probably missing something obvious here.我是编码和 python 的新手,所以我可能在这里遗漏了一些明显的东西。 Any help and explanation would be greatly appreciated.任何帮助和解释将不胜感激。 Thank you!谢谢!

import requests # to get image from the web
import shutil # to save it locally

url = "https://www.us.kohler.com/webassets/kpna/catalog/pdf/en/K-10411_spec_US-CA_Kohler_en.pdf"
filename = 'TEST-Image.pdf'

r = requests.get(url, stream = True)

if r.status_code == 200:

    r.raw.decode_content = True

with open(filename,'wb') as f:
    shutil.copyfileobj(r.raw, f)
    
    print('PDF sucessfully Downloaded: ',filename)
else:
    print('PDF Couldn\'t be retrieved')

The issue here, at least with the specific link provided, is that something on Kohler's side does not appreciate requests without a user-agent set in the headers.这里的问题,至少对于提供的特定链接来说,是科勒方面的某些东西不喜欢没有在标头中设置user-agent请求。 This is either a bug, or intentional.这要么是错误,要么是故意的。 It may actually be an attempt to prevent people from doing exactly what you're doing - mass downloading their manuals.这实际上可能是为了防止人们完全按照您正在做的事情进行——大量下载他们的手册。 Regardless, the solution is simple.无论如何,解决方案很简单。

Modify your requests call to look like this:将您的请求调用修改为如下所示:

headers = {'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.114 Safari/537.36'}
r = requests.get(url, stream = True, headers = headers)

Note that the actual user-agent string provided is just the standard string for Chrome on Windows 10. You could probably use any user-agent string you wanted.请注意,提供的实际user-agent字符串只是 Windows 10 上 Chrome 的标准字符串。您可能可以使用任何您想要的user-agent字符串。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM