简体   繁体   English

从Python中的URL下载文件

[英]Downloading files from URL in Python

I have a CSV file that has 67000 rows of URL. 我有一个具有67000行URL的CSV文件。 Each URL leads to download other datasets in format of CSV, HWP, ZIP, etc. 每个URL导致以CSV,HWP,ZIP等格式下载其他数据集。

This is the code I have written: 这是我编写的代码:

import cgi
import requests


SAVE_DIR = 'C:/dataset'

def downloadURLResource(url):
    r = requests.get(url.rstrip(), stream=True)
    if r.status_code == 200:
        targetFileName = requests.utils.unquote(cgi.parse_header(r.headers['content-disposition'])[1]['filename'])
        with open("{}/{}".format(SAVE_DIR, targetFileName), 'wb') as f:
            for chunk in r.iter_content(chunk_size=1024):
                f.write(chunk)
        return targetFileName


with open('C:/urlcsv.csv') as f:
    print(list(map(downloadURLResource, f.readlines())))

This code worked fine until it reached the 203rd URL. 直到到达第203个URL,此代码才能正常工作。

When I checked on shell, this url didn't have content disposition and caused an error. 当我检查shell时,此url没有内容配置,并导致了错误。

Moreover, it downloaded 201 and 202 but when I check the SAVE_DIR, there were 200 files total which means 2 files were missing. 此外,它下载了201和202,但是当我检查SAVE_DIR时,总共有200个文件,这意味着缺少2个文件。

My questions are: 我的问题是:

(1) How do I know which files were not downloaded without manually checking the names of the downloaded files and URL? (1)如果不手动检查下载文件的名称和URL,如何知道未下载哪些文件? (No Error code was shown in Python Shell and it just skipped) (没有错误代码显示在Python Shell中,只是略过了)

(2) How can I fix my code to print names of files or URLs which had not been downloaded? (2)如何修复代码以打印未下载的文件或URL的名称? (Both skipped files that did not stop + no error shown on shell and ones that stopped and showed error on shell) (这两个跳过的文件都没有停止,并且在shell上没有显示错误,而那些停止并在shell上显示错误的文件)


This is the error that stopped me from downloading: 这是导致我无法下载的错误:

Traceback (most recent call last):

  File "C:\Users\pc\Desktop\오목눈이\URL다운.py", line 38, in <module>

   print(list(map(downloadURLResource, f.readlines())))

  File "C:\Users\pc\Desktop\오목눈이\URL다운.py", line 30, in downloadURLResource
    targetFileName = requests.utils.unquote(cgi.parse_header(r.headers['content-disposition'])[1]['filename'])

  File "C:\Python34\lib\site-packages\requests\structures.py", line 54, in __getitem__

   return self._store[key.lower()][1]

KeyError: 'content-disposition'

url http://www.data.go.kr/dataset/fileDownload.do?atchFileId=FILE_000000001210727&fileDetailSn=1&publicDataDetailPk=uddi:4cf4dc4c-e0e9-4aee-929e-b2a0431bf03e had no content-disposition header 网址http://www.data.go.kr/dataset/fileDownload.do?atchFileId=FILE_000000001210727&fileDetailSn=1&publicDataDetailPk=uddi:4cf4dc4c-e0e9-4aee-929e-b2a0431bf03e没有内容配置标头

Traceback (most recent call last):

 File "C:\Users\pc\Desktop\오목눈이\URL다운.py", line 46, in <module>

  print(list(map(downloadURLResource, f.readlines())))

 File "C:\Users\pc\Desktop\오목눈이\URL다운.py", line 38, in downloadURLResource

  return targetFileName

UnboundLocalError: local variable 'targetFileName' referenced before assignment

You are filtering out anything that isn't a 200 response with if r.status_code == 200: if r.status_code == 200: ,您将过滤掉不是200响应的所有内容if r.status_code == 200:

Exactly what you do depends in the response of the request when a file isn't there but assuming it would be a 404 you could try something like 确切的说,您要执行的操作取决于文件不存在时的请求响应,但是假设它是404,则可以尝试执行以下操作

r = requests.get(url.rstrip(), stream=True)
    if r.status_code == 200:
        content_dispotistion = r.headers.get('content-disposition')
        if content_disposition is not None:
            targetFileName = requests.utils.unquote(cgi.parse_header(content_dispotistion)[1]['filename'])
            with open("{}/{}".format(SAVE_DIR, targetFileName), 'wb') as f:
                for chunk in r.iter_content(chunk_size=1024):
                    f.write(chunk)
            return targetFileName
        else:
            print('url {} had no content-disposition header'.format(url))
    elif r.status_code == 404:
        print('{} returned a 404, no file was downloaded'.format(url))
    else:
        print('something else went wrong with {}'.format(url))

Your question is not very reproducible so it is hard for others to test. 您的问题不是很容易重复,因此其他人很难测试。 Consider add some of the URLs that caused problems to the question. 考虑添加一些导致问题的URL。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM