[英]running multiple threads in python, simultaneously - is it possible?
I'm writing a little crawler that should fetch a URL multiple times, I want all of the threads to run at the same time (simultaneously). 我正在编写一个应该多次获取URL的小爬虫,我希望所有的线程同时运行(同时)。
I've written a little piece of code that should do that. 我写了一小段应该这样做的代码。
import thread
from urllib2 import Request, urlopen, URLError, HTTPError
def getPAGE(FetchAddress):
attempts = 0
while attempts < 2:
req = Request(FetchAddress, None)
try:
response = urlopen(req, timeout = 8) #fetching the url
print "fetched url %s" % FetchAddress
except HTTPError, e:
print 'The server didn\'t do the request.'
print 'Error code: ', str(e.code) + " address: " + FetchAddress
time.sleep(4)
attempts += 1
except URLError, e:
print 'Failed to reach the server.'
print 'Reason: ', str(e.reason) + " address: " + FetchAddress
time.sleep(4)
attempts += 1
except Exception, e:
print 'Something bad happened in gatPAGE.'
print 'Reason: ', str(e.reason) + " address: " + FetchAddress
time.sleep(4)
attempts += 1
else:
try:
return response.read()
except:
"there was an error with response.read()"
return None
return None
url = ("http://www.domain.com",)
for i in range(1,50):
thread.start_new_thread(getPAGE, url)
from the apache logs it doesn't seems like the threads are running simultaneously, there's a little gap between requests, it's almost undetectable but I can see that the threads are not really parallel. 从apache日志来看,似乎线程并不是同时运行,请求之间有一点差距,它几乎检测不到但我可以看到线程并不是真正的并行。
I've read about GIL, is there a way to bypass it with out calling a C\\C++ code? 我读过GIL,有没有办法绕过它而不用调用C \\ C ++代码? I can't really understand how does threading is possible with GIL? 我真的不明白GIL如何实现线程化? python basically interpreters the next thread as soon as it finishes with the previous one? python基本上解释了下一个线程一旦完成前一个线程?
Thanks. 谢谢。
As you point out, the GIL often prevents Python threads from running in parallel. 正如您所指出的,GIL经常阻止Python线程并行运行。
However, that's not always the case. 然而,情况并非总是如此。 One exception is I/O-bound code. I / O绑定代码是一个例外。 When a thread is waiting for an I/O request to complete, it would typically have released the GIL before entering the wait. 当一个线程正在等待I / O请求完成时,它通常会在进入等待之前释放GIL。 This means that other threads can make progress in the meantime. 这意味着其他线程可以在此期间取得进展。
In general, however, multiprocessing
is the safer bet when true parallelism is required. 但是,一般而言,当需要真正的并行性时, multiprocessing
是更安全的选择。
I've read about GIL, is there a way to bypass it with out calling a C\\C++ code? 我读过GIL,有没有办法绕过它而不用调用C \\ C ++代码?
Not really. 并不是的。 Functions called through ctypes will release the GIL for the duration of those calls. 通过ctypes调用的函数将在这些调用期间释放GIL。 Functions that perform blocking I/O will release it too. 执行阻塞I / O的函数也会释放它。 There are other similar situations, but they always involve code outside the main Python interpreter loop. 还有其他类似的情况,但它们总是涉及主Python解释器循环之外的代码。 You can't let go of the GIL in your Python code. 你不能放弃Python代码中的GIL。
You can use an approach like this to create all threads, have them wait for a condition object, and then have them start fetching the url " simultaneously ": 您可以使用这样的方法来创建所有线程,让它们等待条件对象,然后让它们开始“ 同时 ”获取URL:
#!/usr/bin/env python
import threading
import datetime
import urllib2
allgo = threading.Condition()
class ThreadClass(threading.Thread):
def run(self):
allgo.acquire()
allgo.wait()
allgo.release()
print "%s at %s\n" % (self.getName(), datetime.datetime.now())
url = urllib2.urlopen("http://www.ibm.com")
for i in range(50):
t = ThreadClass()
t.start()
allgo.acquire()
allgo.notify_all()
allgo.release()
This would get you a bit closer to having all fetches happen at the same time, BUT : 这会让你更接近于同时发生所有提取, 但是 :
accept()
call to respond to your request. 您从中获取内容的Web服务器将使用accept()
调用来响应您的请求。 For correct behavior, that is implemented using a server-global lock to ensure only one server process/thread responds to your query. 对于正确的行为,使用服务器全局锁实现,以确保只有一个服务器进程/线程响应您的查询。 Even if some of your requests arrive at the server simultaneously , this will cause some serialisation. 即使您的某些请求同时到达服务器,也会导致某些序列化。 You will probably get your requests to overlap to a greater degree (ie others starting before some finish), but you're never going to get all of your requests to start simultaneously on the server. 您可能会在更大程度上获得重叠请求(即其他人在完成之前开始),但您永远不会在服务器上同时启动所有请求。
如果您使用Jython或IronPython(以及未来的PyPy)运行代码,它将并行运行
你还可以看看像pypy的未来,我们将拥有软件过渡记忆(从而废除GIL)这一切只是研究和知识分子嘲笑,但它可能会成长为一个大的东西。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.