简体   繁体   中英

Downloading Many Images with Python Requests and Multiprocessing

I'm attempting to download a few thousand images using Python and the multiprocessing and requests libs. Things start off fine but about 100 images in, everything locks up and I have to kill the processes. I'm using python 2.7.6. Here's the code:

import requests
import shutil
from multiprocessing import Pool
from urlparse import urlparse

def get_domain_name(s):
    domain_name = urlparse(s).netloc 
    new_s = re.sub('\:', '_', domain_name)  #replace colons
    return new_s

def grab_image(url):
    response = requests.get(url, stream=True, timeout=2)
    if response.status_code == 200:
        img_name = get_domain_name(url)
        with open(IMG_DST + img_name + ".jpg", 'wb') as outf:
            shutil.copyfileobj(response.raw, outf)
        del response

def main():                                        
    with open(list_of_image_urls, 'r') as f:                 
        urls = f.read().splitlines()                                                             
    urls.sort()                                    
    pool = Pool(processes=4, maxtasksperchild=2)   
    pool.map(grab_image, urls)                     
    pool.close()                                   
    pool.join()

if __name__ == "__main__":
    main()

Edit: After changing the multiprocessing import to multiprocessing.dummy to use threads instead of processes I am still experiencing the same problem. It seems I'm sometimes hitting a motion jpeg stream instead of a single image, which is causing the associated problems. In order to deal with this issue I'm using a context manager and I created a FileTooBigException. While I haven't implement checking to make sure I've actually downloaded an image file and some other house cleaning, I thought the below code might be useful for someone:

class FileTooBigException(requests.exceptions.RequestException):
    """File over LIMIT_SIZE"""


def grab_image(url):
    try:
        img = ''
        with closing(requests.get(url, stream=True, timeout=4)) as response:
            if response.status_code == 200:
                content_length = 0
                img_name = get_domain_name(url)
                img = IMG_DST + img_name + ".jpg"
                with open(img, 'wb') as outf:
                    for chunk in response.iter_content(chunk_size=CHUNK_SIZE):
                        outf.write(chunk)
                        content_length = content_length + CHUNK_SIZE
                        if(content_length > LIMIT_SIZE):
                            raise FileTooBigException(response)
    except requests.exceptions.Timeout:
        pass
    except requests.exceptions.ConnectionError:
        pass
    except socket.timeout:
        pass
    except FileTooBigException:
        os.remove(img)
        pass

And, any suggested improvements welcome!

There is no point in using multiprocessing for I/O concurrency. In network I/O the thread involved just waits most of the time doing nothing. And Python threads are excellent for doing nothing. So use a threadpool , instead of a processpool. Each process consumes a lot of resouces and are unnecessary for I/O bound activities. While threads share the process state and are exactly what you are looking for.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM