從Python中的URL下載文件

Question

我有一個具有67000行URL的CSV文件。 每個URL導致以CSV，HWP，ZIP等格式下載其他數據集。

這是我編寫的代碼：

import cgi
import requests


SAVE_DIR = 'C:/dataset'

def downloadURLResource(url):
    r = requests.get(url.rstrip(), stream=True)
    if r.status_code == 200:
        targetFileName = requests.utils.unquote(cgi.parse_header(r.headers['content-disposition'])[1]['filename'])
        with open("{}/{}".format(SAVE_DIR, targetFileName), 'wb') as f:
            for chunk in r.iter_content(chunk_size=1024):
                f.write(chunk)
        return targetFileName


with open('C:/urlcsv.csv') as f:
    print(list(map(downloadURLResource, f.readlines())))

直到到達第203個URL，此代碼才能正常工作。

當我檢查shell時，此url沒有內容配置，並導致了錯誤。

此外，它下載了201和202，但是當我檢查SAVE_DIR時，總共有200個文件，這意味着缺少2個文件。

我的問題是：

（1）如果不手動檢查下載文件的名稱和URL，如何知道未下載哪些文件？ （沒有錯誤代碼顯示在Python Shell中，只是略過了）

（2）如何修復代碼以打印未下載的文件或URL的名稱？ （這兩個跳過的文件都沒有停止，並且在shell上沒有顯示錯誤，而那些停止並在shell上顯示錯誤的文件）

這是導致我無法下載的錯誤：

Traceback (most recent call last):

  File "C:\Users\pc\Desktop\오목눈이\URL다운.py", line 38, in <module>

   print(list(map(downloadURLResource, f.readlines())))

  File "C:\Users\pc\Desktop\오목눈이\URL다운.py", line 30, in downloadURLResource
    targetFileName = requests.utils.unquote(cgi.parse_header(r.headers['content-disposition'])[1]['filename'])

  File "C:\Python34\lib\site-packages\requests\structures.py", line 54, in __getitem__

   return self._store[key.lower()][1]

KeyError: 'content-disposition'

網址http://www.data.go.kr/dataset/fileDownload.do?atchFileId=FILE_000000001210727&fileDetailSn=1&publicDataDetailPk=uddi:4cf4dc4c-e0e9-4aee-929e-b2a0431bf03e沒有內容配置標頭

Traceback (most recent call last):

 File "C:\Users\pc\Desktop\오목눈이\URL다운.py", line 46, in <module>

  print(list(map(downloadURLResource, f.readlines())))

 File "C:\Users\pc\Desktop\오목눈이\URL다운.py", line 38, in downloadURLResource

  return targetFileName

UnboundLocalError: local variable 'targetFileName' referenced before assignment

Answer 1

if r.status_code == 200: ，您將過濾掉不是200響應的所有內容if r.status_code == 200:

確切的說，您要執行的操作取決於文件不存在時的請求響應，但是假設它是404，則可以嘗試執行以下操作

r = requests.get(url.rstrip(), stream=True)
    if r.status_code == 200:
        content_dispotistion = r.headers.get('content-disposition')
        if content_disposition is not None:
            targetFileName = requests.utils.unquote(cgi.parse_header(content_dispotistion)[1]['filename'])
            with open("{}/{}".format(SAVE_DIR, targetFileName), 'wb') as f:
                for chunk in r.iter_content(chunk_size=1024):
                    f.write(chunk)
            return targetFileName
        else:
            print('url {} had no content-disposition header'.format(url))
    elif r.status_code == 404:
        print('{} returned a 404, no file was downloaded'.format(url))
    else:
        print('something else went wrong with {}'.format(url))

您的問題不是很容易重復，因此其他人很難測試。 考慮添加一些導致問題的URL。

從Python中的URL下載文件

問題描述

1 個解決方案

解決方案1
1 已采納 2017-07-20 08:47:35

從Python中的URL下載文件

問題描述

1 個解決方案

解決方案1 1 已采納 2017-07-20 08:47:35

解決方案1
1 已采納 2017-07-20 08:47:35