简体   繁体   English

使用python下载大量文件

[英]downloading large number of files using python

test.txt contains the list of files to be downloaded: test.txt包含要下载的文件列表:

http://example.com/example/afaf1.tif
http://example.com/example/afaf2.tif
http://example.com/example/afaf3.tif
http://example.com/example/afaf4.tif
http://example.com/example/afaf5.tif

How these files can be downloaded using python with maximum download speed? 如何使用python以最大的下载速度下载这些文件?

my thinking was as follows: 我的想法如下:

import urllib.request
with open ('test.txt', 'r') as f:
    lines = f.read().splitlines()
    for line in lines:
        response = urllib.request.urlopen(line)

What after that?How to select download directory? 之后呢?如何选择下载目录?

Select a path to your desired output directory ( output_dir ). 选择所需输出目录的路径( output_dir )。 In your for loop split every url on / character and use the last peace as the filename. 在for循环中,将/字符上的每个url分割开,然后使用最后一个和平号作为文件名。 Also open the files for writing in binary mode wb since the response.read() returns bytes , not str . 同时打开文件以二进制模式wb进行写入,因为response.read()返回的是bytes ,而不是str

import os
import urllib.request

output_dir = 'path/to/you/output/dir'

with open ('test.txt', 'r') as f:
    lines = f.read().splitlines()
    for line in lines:
        response = urllib.request.urlopen(line)
        output_file = os.path.join(output_dir, line.split('/')[-1])
        with open(output_file, 'wb') as writer:
            writer.write(response.read())

Note: 注意:

Downloading multiple files can be faster if you use multiple threads since the download is rarely using the full bandwidth of your internet connection._ 如果您使用多个线程,则下载多个文件可能会更快,因为下载很少会占用Internet连接的全部带宽。_

Also if the files you are downloading are pretty big you should probably stream the read (reading chunk by chunk). 另外,如果要下载的文件很大,则可能应该流式传输读取内容(逐块读取)。 As @Tiran commented you should use shutil.copyfileobj(response, writer) instead of writer.write(response.read()) . 正如@Tiran所说,您应该使用shutil.copyfileobj(response, writer)而不是writer.write(response.read())

I would only add that you should probably always specify the length parameter too: shutil.copyfileobj(response, writer, 5*1024*1024) # (at least 5MB) since the default value of 16kb is really small and it will just slow things down. 我只补充说,您可能也应该始终指定length参数: shutil.copyfileobj(response, writer, 5*1024*1024) # (at least 5MB)因为默认值16kb确实很小,这只会减慢速度下。

This works fine for me: (note that name must be absolute, for example 'afaf1.tif') 这对我来说很好用:(请注意,名称必须是绝对的,例如'afaf1.tif')

import urllib,os
def download(baseUrl,fileName,layer=0):
    print 'Trying to download file:',fileName
    url = baseUrl+fileName
    name = os.path.join('foldertodwonload',fileName)
    try:
        #Note that folder needs to exist
        urllib.urlretrieve (url,name)
    except:
        # Upon failure to download retries total 5 times
        print 'Download failed'
        print 'Could not download file:',fileName
        if layer > 4:
            return
        else:
            layer+=1
        print 'retrying',str(layer)+'/5'
        download(baseUrl,fileName,layer)
    print fileName+' downloaded'

for fileName in nameList:
    download(url,fileName)

Moved unnecessary code out from try block 将不必要的代码从try块中移出

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM