使用 urllib 下载 pdf？

Question

我正在尝试使用 urllib 从网站下载 pdf 文件。 这是我到目前为止得到的：

import urllib

def download_file(download_url):
    web_file = urllib.urlopen(download_url)
    local_file = open('some_file.pdf', 'w')
    local_file.write(web_file.read())
    web_file.close()
    local_file.close()

if __name__ == 'main':
    download_file('http://www.example.com/some_file.pdf')

当我运行此代码时，我得到的只是一个空的 pdf 文件。 我究竟做错了什么？

Answer 1

这是一个有效的示例：

import urllib2

def main():
    download_file("http://mensenhandel.nl/files/pdftest2.pdf")

def download_file(download_url):
    response = urllib2.urlopen(download_url)
    file = open("document.pdf", 'wb')
    file.write(response.read())
    file.close()
    print("Completed")

if __name__ == "__main__":
    main()

Answer 2

将open('some_file.pdf', 'w')改为open('some_file.pdf', 'wb') ，pdf文件是二进制文件，所以你需要'b'。 几乎所有无法在文本编辑器中打开的文件都是如此。

Answer 3

尝试使用urllib.retrieve (Python 3) 并这样做：

from urllib.request import urlretrieve

def download_file(download_url):
    urlretrieve(download_url, 'path_to_save_plus_some_file.pdf')

if __name__ == 'main':
    download_file('http://www.example.com/some_file.pdf')

Answer 4

尝试了上面的代码，它们在某些情况下工作正常，但是对于某些嵌入了 pdf 的网站，您可能会收到类似HTTPError: HTTP Error 403: Forbidden 的错误。 此类网站具有一些服务器安全功能，可以阻止已知的机器人。 在 urllib 的情况下，它使用一个标头，它会说 ====> python urllib/3.3.0 之类的东西。 所以我建议在 urllib 的请求模块中也添加一个自定义标头，如下所示。

from urllib.request import Request, urlopen 
import requests  
url="https://realpython.com/python-tricks-sample-pdf"  
import urllib.request  
req = urllib.request.Request(url, headers={'User-Agent': 'Mozilla/5.0'})  
r = requests.get(url)

with open("<location to dump pdf>/<name of file>.pdf", "wb") as code:
    code.write(r.content)

Answer 5

仅供参考：您还可以使用 wget 轻松下载 url pdf。 Urllib 版本不断变化并经常导致问题（至少对我而言）。

import wget

wget.download(link)

除了输入 pdf 链接，您还可以修改您的代码，以便您输入网页链接并从中提取所有 pdf。 这是一个指南： https : //medium.com/the-innovation/notesdownloader-use-web-scraping-to-download-all-pdfs-with-python-511ea9f55e48

Answer 6

我建议使用以下代码行

import urllib.request
import shutil
url = "link to your website for pdf file to download"
output_file = "local directory://name.pdf"
with urllib.request.urlopen(url) as response, open(output_file, 'wb') as out_file:
     shutil.copyfileobj(response, out_file)

使用 urllib 下载 pdf？

问题描述

6 个解决方案

解决方案1
24 2014-07-19 21:57:49

解决方案2
12 2014-07-19 21:15:37

解决方案3
6 2018-02-05 05:52:53

解决方案4
4 2018-09-25 11:27:30

解决方案5
3 2020-12-24 09:27:37

解决方案6
1 2018-03-02 00:18:01

使用 urllib 下载 pdf？

问题描述

6 个解决方案

解决方案1 24 2014-07-19 21:57:49

解决方案2 12 2014-07-19 21:15:37

解决方案3 6 2018-02-05 05:52:53

解决方案4 4 2018-09-25 11:27:30

解决方案5 3 2020-12-24 09:27:37

解决方案6 1 2018-03-02 00:18:01

解决方案1
24 2014-07-19 21:57:49

解决方案2
12 2014-07-19 21:15:37

解决方案3
6 2018-02-05 05:52:53

解决方案4
4 2018-09-25 11:27:30

解决方案5
3 2020-12-24 09:27:37

解决方案6
1 2018-03-02 00:18:01