简体   繁体   English

request.urlretrieve在多处理Python中被卡住了

[英]request.urlretrieve in multiprocessing Python gets stuck

I am trying to download images from a list of URLs using Python. 我正在尝试使用Python从URL列表中下载图像。 To make the process faster, I used the multiprocessing library. 为了加快这个过程,我使用了多处理库。

The problem I am facing is that the script often hangs/freezes on its own, and I don't know why. 我面临的问题是脚本经常自行挂起/冻结,我不知道为什么。

Here is the code that I am using 这是我正在使用的代码

...
import multiprocessing as mp

def getImages(val):

    #Dowload images
    try:
        url= # preprocess the url from the input val
        local= #Filename Generation From Global Varables And Rand Stuffs...
        urllib.request.urlretrieve(url,local)
        print("DONE - " + url)
        return 1
    except Exception as e:
        print("CAN'T DOWNLOAD - " + url )
        return 0

if __name__ == '__main__':

    files = "urls.txt"
    lst = list(open(files))
    lst = [l.replace("\n", "") for l in lst]

    pool = mp.Pool(processes=4)
    res = pool.map(getImages, lst)

    print ("tempw")

It often gets stuck halfway through the list (it prints DONE, or CAN't DOWNLOAD to half of the list it has processed but I don't know what is happening on the rest of them). 它经常被卡在列表的中间(它打印DONE,或者不能下载到已处理的列表的一半但我不知道其余部分发生了什么)。 Has anyone faced this problem? 有人遇到过这个问题吗? I have searched for similar problems (eg this link ) but found no answer. 我搜索过类似的问题(例如这个链接 ),但没有找到答案。

Thanks in advance 提前致谢

It looks like you're facing a GIL issue : The python Global Interpreter Lock basically forbid python to do more than one task at the same time. 看起来你正面临一个GIL问题:python Global Interpreter Lock基本上禁止python同时执行多个任务。 The Multiprocessing module is really launching separate instances of python to get the work done in parallel. Multiprocessing模块实际上是启动python的单独实例,以便并行完成工作。

But in your case, urllib is called in all these instances : each of them is trying to lock the IO process : the one who succeed (eg come first) get you the result, while the others (trying to lock an already locked process) fail. 但在你的情况下,在所有这些实例中调用urllib:每个实例都试图锁定IO进程:成功的人(例如先来)会得到结果,而其他人(试图锁定已经锁定的进程)失败。

This is a very simplified explanation, but here are some additionnal ressources : 这是一个非常简化的解释,但这里有一些额外的资源:

You can find another way to parallelize requests here : Multiprocessing useless with urllib2? 你可以在这里找到另一种并行化请求的方法: 多处理与urllib2无关?

And more info about the GIL here : What is a global interpreter lock (GIL)? 有关GIL的更多信息: 什么是全局解释器锁(GIL)?

Ok, I have found an answer. 好的,我找到了答案。

A possible culprit was the script was stuck in connecting/downloading from the URL. 一个可能的罪魁祸首是脚本被困在从URL连接/下载。 So what I added was a socket timeout to limit the time to connect and download the image. 所以我添加了一个套接字超时来限制连接和下载图像的时间。

And now, the issue no longer bothers me. 而现在,这个问题不再困扰我了。

Here is my complete code 这是我的完整代码

...
import multiprocessing as mp

import socket

# Set the default timeout in seconds
timeout = 20
socket.setdefaulttimeout(timeout)

def getImages(val):

    #Dowload images
    try:
        url= # preprocess the url from the input val
        local= #Filename Generation From Global Varables And Rand Stuffs...
        urllib.request.urlretrieve(url,local)
        print("DONE - " + url)
        return 1
    except Exception as e:
        print("CAN'T DOWNLOAD - " + url )
        return 0

if __name__ == '__main__':

    files = "urls.txt"
    lst = list(open(files))
    lst = [l.replace("\n", "") for l in lst]

    pool = mp.Pool(processes=4)
    res = pool.map(getImages, lst)

    print ("tempw")

Hope this solution helps others who are facing the same issue 希望这个解决方案可以帮助那些面临同样问题的人

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM