使用python下载大量文件

Question

test.txt contains the list of files to be downloaded: test.txt包含要下载的文件列表：

http://example.com/example/afaf1.tif
http://example.com/example/afaf2.tif
http://example.com/example/afaf3.tif
http://example.com/example/afaf4.tif
http://example.com/example/afaf5.tif

How these files can be downloaded using python with maximum download speed? 如何使用python以最大的下载速度下载这些文件？

my thinking was as follows: 我的想法如下：

import urllib.request
with open ('test.txt', 'r') as f:
    lines = f.read().splitlines()
    for line in lines:
        response = urllib.request.urlopen(line)

What after that?How to select download directory? 之后呢？如何选择下载目录？

Answer 1

Select a path to your desired output directory ( output_dir ). 选择所需输出目录的路径（ output_dir ）。 In your for loop split every url on / character and use the last peace as the filename. 在for循环中，将/字符上的每个url分割开，然后使用最后一个和平号作为文件名。 Also open the files for writing in binary mode wb since the response.read() returns bytes , not str . 同时打开文件以二进制模式wb进行写入，因为response.read()返回的是bytes ，而不是str 。

import os
import urllib.request

output_dir = 'path/to/you/output/dir'

with open ('test.txt', 'r') as f:
    lines = f.read().splitlines()
    for line in lines:
        response = urllib.request.urlopen(line)
        output_file = os.path.join(output_dir, line.split('/')[-1])
        with open(output_file, 'wb') as writer:
            writer.write(response.read())

Note: 注意：

Downloading multiple files can be faster if you use multiple threads since the download is rarely using the full bandwidth of your internet connection._ 如果您使用多个线程，则下载多个文件可能会更快，因为下载很少会占用Internet连接的全部带宽。_

Also if the files you are downloading are pretty big you should probably stream the read (reading chunk by chunk). 另外，如果要下载的文件很大，则可能应该流式传输读取内容（逐块读取）。 As @Tiran commented you should use shutil.copyfileobj(response, writer) instead of writer.write(response.read()) . 正如@Tiran所说，您应该使用shutil.copyfileobj(response, writer)而不是writer.write(response.read()) 。

I would only add that you should probably always specify the length parameter too: shutil.copyfileobj(response, writer, 5*1024*1024) # (at least 5MB) since the default value of 16kb is really small and it will just slow things down. 我只补充说，您可能也应该始终指定length参数： shutil.copyfileobj(response, writer, 5*1024*1024) # (at least 5MB)因为默认值16kb确实很小，这只会减慢速度下。

Answer 2

This works fine for me: (note that name must be absolute, for example 'afaf1.tif') 这对我来说很好用：（请注意，名称必须是绝对的，例如'afaf1.tif'）

import urllib,os
def download(baseUrl,fileName,layer=0):
    print 'Trying to download file:',fileName
    url = baseUrl+fileName
    name = os.path.join('foldertodwonload',fileName)
    try:
        #Note that folder needs to exist
        urllib.urlretrieve (url,name)
    except:
        # Upon failure to download retries total 5 times
        print 'Download failed'
        print 'Could not download file:',fileName
        if layer > 4:
            return
        else:
            layer+=1
        print 'retrying',str(layer)+'/5'
        download(baseUrl,fileName,layer)
    print fileName+' downloaded'

for fileName in nameList:
    download(url,fileName)

Moved unnecessary code out from try block 将不必要的代码从try块中移出

使用python下载大量文件

问题描述

2 个解决方案

解决方案1
2 2013-09-18 08:44:42

解决方案2
1 2013-09-18 13:19:26

使用python下载大量文件

问题描述

2 个解决方案

解决方案1 2 2013-09-18 08:44:42

解决方案2 1 2013-09-18 13:19:26

解决方案1
2 2013-09-18 08:44:42

解决方案2
1 2013-09-18 13:19:26