简体   繁体   中英

request.urlretrieve in multiprocessing Python gets stuck

I am trying to download images from a list of URLs using Python. To make the process faster, I used the multiprocessing library.

The problem I am facing is that the script often hangs/freezes on its own, and I don't know why.

Here is the code that I am using

...
import multiprocessing as mp

def getImages(val):

    #Dowload images
    try:
        url= # preprocess the url from the input val
        local= #Filename Generation From Global Varables And Rand Stuffs...
        urllib.request.urlretrieve(url,local)
        print("DONE - " + url)
        return 1
    except Exception as e:
        print("CAN'T DOWNLOAD - " + url )
        return 0

if __name__ == '__main__':

    files = "urls.txt"
    lst = list(open(files))
    lst = [l.replace("\n", "") for l in lst]

    pool = mp.Pool(processes=4)
    res = pool.map(getImages, lst)

    print ("tempw")

It often gets stuck halfway through the list (it prints DONE, or CAN't DOWNLOAD to half of the list it has processed but I don't know what is happening on the rest of them). Has anyone faced this problem? I have searched for similar problems (eg this link ) but found no answer.

Thanks in advance

It looks like you're facing a GIL issue : The python Global Interpreter Lock basically forbid python to do more than one task at the same time. The Multiprocessing module is really launching separate instances of python to get the work done in parallel.

But in your case, urllib is called in all these instances : each of them is trying to lock the IO process : the one who succeed (eg come first) get you the result, while the others (trying to lock an already locked process) fail.

This is a very simplified explanation, but here are some additionnal ressources :

You can find another way to parallelize requests here : Multiprocessing useless with urllib2?

And more info about the GIL here : What is a global interpreter lock (GIL)?

Ok, I have found an answer.

A possible culprit was the script was stuck in connecting/downloading from the URL. So what I added was a socket timeout to limit the time to connect and download the image.

And now, the issue no longer bothers me.

Here is my complete code

...
import multiprocessing as mp

import socket

# Set the default timeout in seconds
timeout = 20
socket.setdefaulttimeout(timeout)

def getImages(val):

    #Dowload images
    try:
        url= # preprocess the url from the input val
        local= #Filename Generation From Global Varables And Rand Stuffs...
        urllib.request.urlretrieve(url,local)
        print("DONE - " + url)
        return 1
    except Exception as e:
        print("CAN'T DOWNLOAD - " + url )
        return 0

if __name__ == '__main__':

    files = "urls.txt"
    lst = list(open(files))
    lst = [l.replace("\n", "") for l in lst]

    pool = mp.Pool(processes=4)
    res = pool.map(getImages, lst)

    print ("tempw")

Hope this solution helps others who are facing the same issue

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM