简体   繁体   English

同时在python中运行多个线程 - 这可能吗?

[英]running multiple threads in python, simultaneously - is it possible?

I'm writing a little crawler that should fetch a URL multiple times, I want all of the threads to run at the same time (simultaneously). 我正在编写一个应该多次获取URL的小爬虫,我希望所有的线程同时运行(同时)。

I've written a little piece of code that should do that. 我写了一小段应该这样做的代码。

import thread
from urllib2 import Request, urlopen, URLError, HTTPError


def getPAGE(FetchAddress):
    attempts = 0
    while attempts < 2:
        req = Request(FetchAddress, None)
        try:
            response = urlopen(req, timeout = 8) #fetching the url
            print "fetched url %s" % FetchAddress
        except HTTPError, e:
            print 'The server didn\'t do the request.'
            print 'Error code: ', str(e.code) + "  address: " + FetchAddress
            time.sleep(4)
            attempts += 1
        except URLError, e:
            print 'Failed to reach the server.'
            print 'Reason: ', str(e.reason) + "  address: " + FetchAddress
            time.sleep(4)
            attempts += 1
        except Exception, e:
            print 'Something bad happened in gatPAGE.'
            print 'Reason: ', str(e.reason) + "  address: " + FetchAddress
            time.sleep(4)
            attempts += 1
        else:
            try:
                return response.read()
            except:
                "there was an error with response.read()"
                return None
    return None

url = ("http://www.domain.com",)

for i in range(1,50):
    thread.start_new_thread(getPAGE, url)

from the apache logs it doesn't seems like the threads are running simultaneously, there's a little gap between requests, it's almost undetectable but I can see that the threads are not really parallel. 从apache日志来看,似乎线程并不是同时运行,请求之间有一点差距,它几乎检测不到但我可以看到线程并不是真正的并行。

I've read about GIL, is there a way to bypass it with out calling a C\\C++ code? 我读过GIL,有没有办法绕过它而不用调用C \\ C ++代码? I can't really understand how does threading is possible with GIL? 我真的不明白GIL如何实现线程化? python basically interpreters the next thread as soon as it finishes with the previous one? python基本上解释了下一个线程一旦完成前一个线程?

Thanks. 谢谢。

As you point out, the GIL often prevents Python threads from running in parallel. 正如您所指出的,GIL经常阻止Python线程并行运行。

However, that's not always the case. 然而,情况并非总是如此。 One exception is I/O-bound code. I / O绑定代码是一个例外。 When a thread is waiting for an I/O request to complete, it would typically have released the GIL before entering the wait. 当一个线程正在等待I / O请求完成时,它通常会在进入等待之前释放GIL。 This means that other threads can make progress in the meantime. 这意味着其他线程可以在此期间取得进展。

In general, however, multiprocessing is the safer bet when true parallelism is required. 但是,一般而言,当需要真正的并行性时, multiprocessing是更安全的选择。

I've read about GIL, is there a way to bypass it with out calling a C\\C++ code? 我读过GIL,有没有办法绕过它而不用调用C \\ C ++代码?

Not really. 并不是的。 Functions called through ctypes will release the GIL for the duration of those calls. 通过ctypes调用的函数将在这些调用期间释放GIL。 Functions that perform blocking I/O will release it too. 执行阻塞I / O的函数也会释放它。 There are other similar situations, but they always involve code outside the main Python interpreter loop. 还有其他类似的情况,但它们总是涉及主Python解释器循环之外的代码。 You can't let go of the GIL in your Python code. 你不能放弃Python代码中的GIL。

You can use an approach like this to create all threads, have them wait for a condition object, and then have them start fetching the url " simultaneously ": 您可以使用这样的方法来创建所有线程,让它们等待条件对象,然后让它们开始“ 同时 ”获取URL:

#!/usr/bin/env python
import threading
import datetime
import urllib2

allgo = threading.Condition()

class ThreadClass(threading.Thread):
    def run(self):
        allgo.acquire()
        allgo.wait()
        allgo.release()
        print "%s at %s\n" % (self.getName(), datetime.datetime.now())
        url = urllib2.urlopen("http://www.ibm.com")

for i in range(50):
    t = ThreadClass()
    t.start()

allgo.acquire()
allgo.notify_all()
allgo.release()

This would get you a bit closer to having all fetches happen at the same time, BUT : 这会让你更接近于同时发生所有提取, 但是

  • The network packets leaving your computer will pass along the ethernet wire in sequence, not at the same time, 离开计算机的网络数据包将按顺序通过以太网线传输,而不是同时传递,
  • Even if you have 16+ cores on your machine, some router, bridge, modem or other equipment in between your machine and the web host is likely to have fewer cores, and may serialize your requests, 即使您的计算机上有16个以上的核心,您的计算机和Web主机之间的某些路由器,网桥,调制解调器或其他设备也可能拥有较少的核心,并且可能会对您的请求进行序列化,
  • The web server you're fetching stuff from will use an accept() call to respond to your request. 您从中获取内容的Web服务器将使用accept()调用来响应您的请求。 For correct behavior, that is implemented using a server-global lock to ensure only one server process/thread responds to your query. 对于正确的行为,使用服务器全局锁实现,以确保只有一个服务器进程/线程响应您的查询。 Even if some of your requests arrive at the server simultaneously , this will cause some serialisation. 即使您的某些请求同时到达服务器,也会导致某些序列化。

You will probably get your requests to overlap to a greater degree (ie others starting before some finish), but you're never going to get all of your requests to start simultaneously on the server. 您可能会在更大程度上获得重叠请求(即其他人在完成之前开始),但您永远不会在服务器上同时启动所有请求。

如果您使用Jython或IronPython(以及未来的PyPy)运行代码,它将并行运行

你还可以看看像pypy的未来,我们将拥有软件过渡记忆(从而废除GIL)这一切只是研究和知识分子嘲笑,但它可能会成长为一个大的东西。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM