[英]How to download a file using python in a 'smarter' way?
我需要在 Python 中通過 http 下載幾個文件。
最明顯的方法就是使用 urllib2:
import urllib2
u = urllib2.urlopen('http://server.com/file.html')
localFile = open('file.html', 'w')
localFile.write(u.read())
localFile.close()
但是我必須處理以某種方式令人討厭的 URL,例如: http://server.com/!Run.aspx/someoddtext/somemore?id=121&m=pdf
。 當通過瀏覽器下載時,文件有一個人類可讀的名稱,即。 accounts.pdf
有沒有辦法在 python 中處理它,所以我不需要知道文件名並將它們硬編碼到我的腳本中?
諸如此類的下載腳本往往會推入一個標題,告訴用戶代理該文件的名稱:
Content-Disposition: attachment; filename="the filename.ext"
如果可以獲取該標頭,則可以獲取正確的文件名。
還有另一個線程提供了一些代碼,可用於抓取Content-Disposition
。
remotefile = urllib2.urlopen('http://example.com/somefile.zip')
remotefile.info()['Content-Disposition']
根據評論和@Oli的答案,我提出了這樣的解決方案:
from os.path import basename
from urlparse import urlsplit
def url2name(url):
return basename(urlsplit(url)[2])
def download(url, localFileName = None):
localName = url2name(url)
req = urllib2.Request(url)
r = urllib2.urlopen(req)
if r.info().has_key('Content-Disposition'):
# If the response has Content-Disposition, we take file name from it
localName = r.info()['Content-Disposition'].split('filename=')[1]
if localName[0] == '"' or localName[0] == "'":
localName = localName[1:-1]
elif r.url != url:
# if we were redirected, the real file name we take from the final URL
localName = url2name(r.url)
if localFileName:
# we can force to save the file as specified name
localName = localFileName
f = open(localName, 'wb')
f.write(r.read())
f.close()
它從Content-Disposition獲取文件名; 如果不存在,則使用URL中的文件名(如果發生重定向,則將考慮最終URL)。
結合以上大部分內容,這是一個更加Python化的解決方案:
import urllib2
import shutil
import urlparse
import os
def download(url, fileName=None):
def getFileName(url,openUrl):
if 'Content-Disposition' in openUrl.info():
# If the response has Content-Disposition, try to get filename from it
cd = dict(map(
lambda x: x.strip().split('=') if '=' in x else (x.strip(),''),
openUrl.info()['Content-Disposition'].split(';')))
if 'filename' in cd:
filename = cd['filename'].strip("\"'")
if filename: return filename
# if no filename was found above, parse it out of the final URL.
return os.path.basename(urlparse.urlsplit(openUrl.url)[2])
r = urllib2.urlopen(urllib2.Request(url))
try:
fileName = fileName or getFileName(url,r)
with open(fileName, 'wb') as f:
shutil.copyfileobj(r,f)
finally:
r.close()
2個Kender :
if localName[0] == '"' or localName[0] == "'":
localName = localName[1:-1]
這是不安全的-Web服務器可以傳遞錯誤的格式名稱,例如[“ file.ext]或[file.ext'],甚至為空, localName [0]將引發異常。正確的代碼如下所示:
localName = localName.replace('"', '').replace("'", "")
if localName == '':
localName = SOME_DEFAULT_FILE_NAME
使用wget
:
custom_file_name = "/custom/path/custom_name.ext"
wget.download(url, custom_file_name)
使用urlretrieve:
urllib.urlretrieve(url, custom_file_name)
如果不存在,urlretrieve也會創建目錄結構。
您需要查看“Content-Disposition”標題,請參閱 kender 的解決方案。
如何以“更智能”的方式使用python下載文件?
發布修改后的解決方案,並具有指定輸出文件夾的功能:
from os.path import basename
import os
from urllib.parse import urlsplit
import urllib.request
def url2name(url):
return basename(urlsplit(url)[2])
def download(url, out_path):
localName = url2name(url)
req = urllib.request.Request(url)
r = urllib.request.urlopen(req)
if r.info().has_key('Content-Disposition'):
# If the response has Content-Disposition, we take file name from it
localName = r.info()['Content-Disposition'].split('filename=')[1]
if localName[0] == '"' or localName[0] == "'":
localName = localName[1:-1]
elif r.url != url:
# if we were redirected, the real file name we take from the final URL
localName = url2name(r.url)
localName = os.path.join(out_path, localName)
f = open(localName, 'wb')
f.write(r.read())
f.close()
download("https://example.com/demofile", '/home/username/tmp')
我剛剛更新了 kender for python3 的答案
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.