简体   繁体   中英

running multiple threads in python, simultaneously - is it possible?

I'm writing a little crawler that should fetch a URL multiple times, I want all of the threads to run at the same time (simultaneously).

I've written a little piece of code that should do that.

import thread
from urllib2 import Request, urlopen, URLError, HTTPError


def getPAGE(FetchAddress):
    attempts = 0
    while attempts < 2:
        req = Request(FetchAddress, None)
        try:
            response = urlopen(req, timeout = 8) #fetching the url
            print "fetched url %s" % FetchAddress
        except HTTPError, e:
            print 'The server didn\'t do the request.'
            print 'Error code: ', str(e.code) + "  address: " + FetchAddress
            time.sleep(4)
            attempts += 1
        except URLError, e:
            print 'Failed to reach the server.'
            print 'Reason: ', str(e.reason) + "  address: " + FetchAddress
            time.sleep(4)
            attempts += 1
        except Exception, e:
            print 'Something bad happened in gatPAGE.'
            print 'Reason: ', str(e.reason) + "  address: " + FetchAddress
            time.sleep(4)
            attempts += 1
        else:
            try:
                return response.read()
            except:
                "there was an error with response.read()"
                return None
    return None

url = ("http://www.domain.com",)

for i in range(1,50):
    thread.start_new_thread(getPAGE, url)

from the apache logs it doesn't seems like the threads are running simultaneously, there's a little gap between requests, it's almost undetectable but I can see that the threads are not really parallel.

I've read about GIL, is there a way to bypass it with out calling a C\\C++ code? I can't really understand how does threading is possible with GIL? python basically interpreters the next thread as soon as it finishes with the previous one?

Thanks.

As you point out, the GIL often prevents Python threads from running in parallel.

However, that's not always the case. One exception is I/O-bound code. When a thread is waiting for an I/O request to complete, it would typically have released the GIL before entering the wait. This means that other threads can make progress in the meantime.

In general, however, multiprocessing is the safer bet when true parallelism is required.

I've read about GIL, is there a way to bypass it with out calling a C\\C++ code?

Not really. Functions called through ctypes will release the GIL for the duration of those calls. Functions that perform blocking I/O will release it too. There are other similar situations, but they always involve code outside the main Python interpreter loop. You can't let go of the GIL in your Python code.

You can use an approach like this to create all threads, have them wait for a condition object, and then have them start fetching the url " simultaneously ":

#!/usr/bin/env python
import threading
import datetime
import urllib2

allgo = threading.Condition()

class ThreadClass(threading.Thread):
    def run(self):
        allgo.acquire()
        allgo.wait()
        allgo.release()
        print "%s at %s\n" % (self.getName(), datetime.datetime.now())
        url = urllib2.urlopen("http://www.ibm.com")

for i in range(50):
    t = ThreadClass()
    t.start()

allgo.acquire()
allgo.notify_all()
allgo.release()

This would get you a bit closer to having all fetches happen at the same time, BUT :

  • The network packets leaving your computer will pass along the ethernet wire in sequence, not at the same time,
  • Even if you have 16+ cores on your machine, some router, bridge, modem or other equipment in between your machine and the web host is likely to have fewer cores, and may serialize your requests,
  • The web server you're fetching stuff from will use an accept() call to respond to your request. For correct behavior, that is implemented using a server-global lock to ensure only one server process/thread responds to your query. Even if some of your requests arrive at the server simultaneously , this will cause some serialisation.

You will probably get your requests to overlap to a greater degree (ie others starting before some finish), but you're never going to get all of your requests to start simultaneously on the server.

如果您使用Jython或IronPython(以及未来的PyPy)运行代码,它将并行运行

你还可以看看像pypy的未来,我们将拥有软件过渡记忆(从而废除GIL)这一切只是研究和知识分子嘲笑,但它可能会成长为一个大的东西。

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM