I'm writing a little crawler that should fetch a URL multiple times, I want all of the threads to run at the same time (simultaneously).
I've written a little piece of code that should do that.
import thread
from urllib2 import Request, urlopen, URLError, HTTPError
def getPAGE(FetchAddress):
attempts = 0
while attempts < 2:
req = Request(FetchAddress, None)
try:
response = urlopen(req, timeout = 8) #fetching the url
print "fetched url %s" % FetchAddress
except HTTPError, e:
print 'The server didn\'t do the request.'
print 'Error code: ', str(e.code) + " address: " + FetchAddress
time.sleep(4)
attempts += 1
except URLError, e:
print 'Failed to reach the server.'
print 'Reason: ', str(e.reason) + " address: " + FetchAddress
time.sleep(4)
attempts += 1
except Exception, e:
print 'Something bad happened in gatPAGE.'
print 'Reason: ', str(e.reason) + " address: " + FetchAddress
time.sleep(4)
attempts += 1
else:
try:
return response.read()
except:
"there was an error with response.read()"
return None
return None
url = ("http://www.domain.com",)
for i in range(1,50):
thread.start_new_thread(getPAGE, url)
from the apache logs it doesn't seems like the threads are running simultaneously, there's a little gap between requests, it's almost undetectable but I can see that the threads are not really parallel.
I've read about GIL, is there a way to bypass it with out calling a C\\C++ code? I can't really understand how does threading is possible with GIL? python basically interpreters the next thread as soon as it finishes with the previous one?
Thanks.
As you point out, the GIL often prevents Python threads from running in parallel.
However, that's not always the case. One exception is I/O-bound code. When a thread is waiting for an I/O request to complete, it would typically have released the GIL before entering the wait. This means that other threads can make progress in the meantime.
In general, however, multiprocessing
is the safer bet when true parallelism is required.
I've read about GIL, is there a way to bypass it with out calling a C\\C++ code?
Not really. Functions called through ctypes will release the GIL for the duration of those calls. Functions that perform blocking I/O will release it too. There are other similar situations, but they always involve code outside the main Python interpreter loop. You can't let go of the GIL in your Python code.
You can use an approach like this to create all threads, have them wait for a condition object, and then have them start fetching the url " simultaneously ":
#!/usr/bin/env python
import threading
import datetime
import urllib2
allgo = threading.Condition()
class ThreadClass(threading.Thread):
def run(self):
allgo.acquire()
allgo.wait()
allgo.release()
print "%s at %s\n" % (self.getName(), datetime.datetime.now())
url = urllib2.urlopen("http://www.ibm.com")
for i in range(50):
t = ThreadClass()
t.start()
allgo.acquire()
allgo.notify_all()
allgo.release()
This would get you a bit closer to having all fetches happen at the same time, BUT :
accept()
call to respond to your request. For correct behavior, that is implemented using a server-global lock to ensure only one server process/thread responds to your query. Even if some of your requests arrive at the server simultaneously , this will cause some serialisation. You will probably get your requests to overlap to a greater degree (ie others starting before some finish), but you're never going to get all of your requests to start simultaneously on the server.
如果您使用Jython或IronPython(以及未来的PyPy)运行代码,它将并行运行
你还可以看看像pypy的未来,我们将拥有软件过渡记忆(从而废除GIL)这一切只是研究和知识分子嘲笑,但它可能会成长为一个大的东西。
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.