[英]downloading large number of files using python
test.txt contains the list of files to be downloaded: test.txt包含要下载的文件列表:
http://example.com/example/afaf1.tif
http://example.com/example/afaf2.tif
http://example.com/example/afaf3.tif
http://example.com/example/afaf4.tif
http://example.com/example/afaf5.tif
How these files can be downloaded using python with maximum download speed? 如何使用python以最大的下载速度下载这些文件?
my thinking was as follows: 我的想法如下:
import urllib.request
with open ('test.txt', 'r') as f:
lines = f.read().splitlines()
for line in lines:
response = urllib.request.urlopen(line)
What after that?How to select download directory? 之后呢?如何选择下载目录?
Select a path to your desired output directory ( output_dir
). 选择所需输出目录的路径(
output_dir
)。 In your for loop split every url on /
character and use the last peace as the filename. 在for循环中,将
/
字符上的每个url分割开,然后使用最后一个和平号作为文件名。 Also open the files for writing in binary mode wb
since the response.read()
returns bytes
, not str
. 同时打开文件以二进制模式
wb
进行写入,因为response.read()
返回的是bytes
,而不是str
。
import os
import urllib.request
output_dir = 'path/to/you/output/dir'
with open ('test.txt', 'r') as f:
lines = f.read().splitlines()
for line in lines:
response = urllib.request.urlopen(line)
output_file = os.path.join(output_dir, line.split('/')[-1])
with open(output_file, 'wb') as writer:
writer.write(response.read())
Note: 注意:
Downloading multiple files can be faster if you use multiple threads since the download is rarely using the full bandwidth of your internet connection._ 如果您使用多个线程,则下载多个文件可能会更快,因为下载很少会占用Internet连接的全部带宽。_
Also if the files you are downloading are pretty big you should probably stream the read (reading chunk by chunk). 另外,如果要下载的文件很大,则可能应该流式传输读取内容(逐块读取)。 As @Tiran commented you should use
shutil.copyfileobj(response, writer)
instead of writer.write(response.read())
. 正如@Tiran所说,您应该使用
shutil.copyfileobj(response, writer)
而不是writer.write(response.read())
。
I would only add that you should probably always specify the length parameter too: shutil.copyfileobj(response, writer, 5*1024*1024) # (at least 5MB)
since the default value of 16kb is really small and it will just slow things down. 我只补充说,您可能也应该始终指定length参数:
shutil.copyfileobj(response, writer, 5*1024*1024) # (at least 5MB)
因为默认值16kb确实很小,这只会减慢速度下。
This works fine for me: (note that name must be absolute, for example 'afaf1.tif') 这对我来说很好用:(请注意,名称必须是绝对的,例如'afaf1.tif')
import urllib,os
def download(baseUrl,fileName,layer=0):
print 'Trying to download file:',fileName
url = baseUrl+fileName
name = os.path.join('foldertodwonload',fileName)
try:
#Note that folder needs to exist
urllib.urlretrieve (url,name)
except:
# Upon failure to download retries total 5 times
print 'Download failed'
print 'Could not download file:',fileName
if layer > 4:
return
else:
layer+=1
print 'retrying',str(layer)+'/5'
download(baseUrl,fileName,layer)
print fileName+' downloaded'
for fileName in nameList:
download(url,fileName)
Moved unnecessary code out from try block 将不必要的代码从try块中移出
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.