简体   繁体   中英

How to download multiple large files concurrently in python?

I am trying to download a series of Warc files from the CommonCrawl database, each of them about 25mb. This is my script:

import json
import urllib.request
from urllib.error import HTTPError

from src.Util import rooted

with open(rooted('data/alexa.txt'), 'r') as alexa:
    for i, url in enumerate(alexa):
        if i % 1000 == 0:
            try:
                request = 'http://index.commoncrawl.org/CC-MAIN-2018-13-index?url={search}*&output=json' \
                    .format(search=url.rstrip())
                page = urllib.request.urlopen(request)
                for line in page:
                    result = json.loads(line)
                    urllib.request.urlretrieve('https://commoncrawl.s3.amazonaws.com/%s' % result['filename'],
                                               rooted('data/warc/%s' % ''.join(c for c in result['url'] if c.isalnum())))
            except HTTPError:
                pass

What this is currently doing is requesting the link to download the Warc file via the CommonCrawl REST API and then initiating the download into the 'data/warc' folder.

The problem is that in each urllib.request.urlretrieve() call, the program hangs until the file is completely downloaded before issuing the next download request. Is there any way the urllib.request.urlretrieve() call can be terminated as soon as the download has been issued and then the file downloaded after or some way to spin a new thread for each of these requests and have all the files downloading simultaneously?

Thanks

Use threads, futures even :)

jobs = []
with ThreadPoolExecutor(max_workers=100) as executor:
    for line in page:

        future = executor.submit(urllib.request.urlretrieve,
                                'https://commoncrawl.s3.amazonaws.com/%s' % result['filename'],
                                 rooted('data/warc/%s' % ''.join(c for c in result['url'] if c.isalnum()))
        jobs.append(future)
...
for f in jobs:
    print(f.result())

read more here: https://docs.python.org/3/library/concurrent.futures.html

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM