urllib.request.urlretrieve返回损坏的文件（如何处理这种网址？）

Question

I want to download about 1000 pdf files from a web page. 我想从网页下载大约1000个pdf文件。 Then I encountered this awkward pdf url format. 然后我遇到了这种尴尬的pdf url格式。 Both requests.get() and urllib.request.urlretrieve() don't work for me. urllib.request.urlretrieve() requests.get()和urllib.request.urlretrieve()都不适用于我。

Usual pdf url looks like : 通常的pdf网址看起来像：

https://webpage.com/this_file.pdf

But this url is like : 但是这个网址就像：

https://gongu.copyright.or.kr/gongu/wrt/cmmn/wrtFileDownload.do?wrtSn=9000001&fileSn=1&wrtFileTy=01

So it doesn't have .pdf in url, and if you click on it, you can download it, But using python's urllib , you get corrupt file. 因此它在url中没有.pdf，如果你点击它，你可以下载它，但是使用python的urllib ，你会得到损坏的文件。

At first I thought it is redirected into some other url. 起初我以为它被重定向到其他一些网址。 So I used request.get(url, allow_retrieves=True) option, the result is the same url as before.. 所以我使用了request.get(url, allow_retrieves=True)选项，结果与之前的URL相同。

filename = './novel/pdf1.pdf'
url = 'https://gongu.copyright.or.kr/gongu/wrt/cmmn/wrtFileDownload.do?wrtSn=9031938&fileSn=1&wrtFileTy=01'

urllib.request.urlretrieve(url, filename)

this code downloads corrupt pdf file. 此代码下载损坏的pdf文件。

Answer 1

I solved it using content field in the retrieved object. 我使用检索到的对象中的内容字段解决了它。


filename = './novel1/pdf1.pdf'
url = . . .

object = requests.get(url)
with open('./novels/'+filename, 'wb') as f:
    f.write(t.content)

refered to this QnA ; 提到这个QnA; Download and save PDF file with Python requests module 使用Python请求模块下载并保存PDF文件

urllib.request.urlretrieve返回损坏的文件（如何处理这种网址？）

问题描述

1 个解决方案

解决方案1
0 2019-05-05 09:12:03

urllib.request.urlretrieve返回损坏的文件（如何处理这种网址？）

问题描述

1 个解决方案

解决方案1 0 2019-05-05 09:12:03

解决方案1
0 2019-05-05 09:12:03