如何以“更智能”的方式使用python下載文件？

Question

我需要在 Python 中通過 http 下載幾個文件。

最明顯的方法就是使用 urllib2：

import urllib2
u = urllib2.urlopen('http://server.com/file.html')
localFile = open('file.html', 'w')
localFile.write(u.read())
localFile.close()

但是我必須處理以某種方式令人討厭的 URL，例如： http://server.com/!Run.aspx/someoddtext/somemore?id=121&m=pdf 。 當通過瀏覽器下載時，文件有一個人類可讀的名稱，即。 accounts.pdf

有沒有辦法在 python 中處理它，所以我不需要知道文件名並將它們硬編碼到我的腳本中？

Answer 1

諸如此類的下載腳本往往會推入一個標題，告訴用戶代理該文件的名稱：

Content-Disposition: attachment; filename="the filename.ext"

如果可以獲取該標頭，則可以獲取正確的文件名。

還有另一個線程提供了一些代碼，可用於抓取Content-Disposition 。

remotefile = urllib2.urlopen('http://example.com/somefile.zip')
remotefile.info()['Content-Disposition']

Answer 2

根據評論和@Oli的答案，我提出了這樣的解決方案：

from os.path import basename
from urlparse import urlsplit

def url2name(url):
    return basename(urlsplit(url)[2])

def download(url, localFileName = None):
    localName = url2name(url)
    req = urllib2.Request(url)
    r = urllib2.urlopen(req)
    if r.info().has_key('Content-Disposition'):
        # If the response has Content-Disposition, we take file name from it
        localName = r.info()['Content-Disposition'].split('filename=')[1]
        if localName[0] == '"' or localName[0] == "'":
            localName = localName[1:-1]
    elif r.url != url: 
        # if we were redirected, the real file name we take from the final URL
        localName = url2name(r.url)
    if localFileName: 
        # we can force to save the file as specified name
        localName = localFileName
    f = open(localName, 'wb')
    f.write(r.read())
    f.close()

它從Content-Disposition獲取文件名； 如果不存在，則使用URL中的文件名（如果發生重定向，則將考慮最終URL）。

Answer 3

結合以上大部分內容，這是一個更加Python化的解決方案：

import urllib2
import shutil
import urlparse
import os

def download(url, fileName=None):
    def getFileName(url,openUrl):
        if 'Content-Disposition' in openUrl.info():
            # If the response has Content-Disposition, try to get filename from it
            cd = dict(map(
                lambda x: x.strip().split('=') if '=' in x else (x.strip(),''),
                openUrl.info()['Content-Disposition'].split(';')))
            if 'filename' in cd:
                filename = cd['filename'].strip("\"'")
                if filename: return filename
        # if no filename was found above, parse it out of the final URL.
        return os.path.basename(urlparse.urlsplit(openUrl.url)[2])

    r = urllib2.urlopen(urllib2.Request(url))
    try:
        fileName = fileName or getFileName(url,r)
        with open(fileName, 'wb') as f:
            shutil.copyfileobj(r,f)
    finally:
        r.close()

Answer 4

2個Kender ：

if localName[0] == '"' or localName[0] == "'":
    localName = localName[1:-1]

這是不安全的-Web服務器可以傳遞錯誤的格式名稱，例如[“ file.ext]或[file.ext']，甚至為空， localName [0]將引發異常。正確的代碼如下所示：

localName = localName.replace('"', '').replace("'", "")
if localName == '':
    localName = SOME_DEFAULT_FILE_NAME

Answer 5

使用wget ：

custom_file_name = "/custom/path/custom_name.ext"
wget.download(url, custom_file_name)

使用urlretrieve：