简体   繁体   中英

Python Thread Communication Solution

I'm writing a very basic multi-threaded web crawler written in Python, and using a While loop for the function that crawls a page and extract urls, as follows:

def crawl():
    while True:
        try:
            p = Page(pool.get(True, 10))
        except Queue.Empty:
            continue

        # then extract urls from a page and put new urls into the queue

(The complete source code is here in another question: Multi-threaded Python Web Crawler Got Stuck )

Now ideally, I want to add a condition to the While loop, to make the while loop exit when:

  1. the pool (the Queue object that stores the urls) is empty, and;

  2. all threads are blocking, waiting to get a url from the queue (which means no threads is putting new urls into the pool, so keeping them waiting doesn't make sense and will make my program stuck.)

For example, something like:

#thread-1.attr == 1 means the thread-1 is blocking. 0 means not blocking

while not (pool.empty() and (thread-1.attr == 1 and thread-2.attr == 1 and ...)):
    #do the crawl stuff

So I'm wondering if there is a away for a thread to check what other active threads are doing, or the status or value of an attribute of other active threads.

I have read the official document on threading.Event(), but still can't figure it out.

Hope someone here can point me the way :)

Thank you very much!

Marcus

You can try to implement what you want from scratch, there is different solutions that come to my mind right now:

  • Using threading.enumerate( ) to check if there is a thread that are still alive.
  • Try to implement a threading pool which let you know which thread a are still alive which are returned to the pool, this have also the benefit of limiting the number of threads that crawl third party website (check here for example).

If you want to not reinvent the wheel you can use existing libraries that implement a thread pool or you can also check gevent which use green thread and offer also a thread pool , i have implemented something similar to this using something like:

while 1:
    try:
        url = queue.get_nowait()
    except Empty:
        # Check that all threads are done.
        if pool.free_count() == pool.size:
            break
    ...

You can also write a sentinel object to your queue that mark the finish of crawling and exist your main loop and wait for threads to finish (using a pool for example).

while 1:
    try:
        url = queue.get_nowait()
        # StopIteration mark that no url will be added to the queue anymore.
        if url is StopIteration:
             break
    except Empty:
        continue
    ...
pool.join()

You can choose the one that you prefer, and hopefully this was helpful.

Consider looking at this solution: Web crawler Using Twisted . As the answer to that question says I would also recommend you look at http://scrapy.org/

Multi-threading (using threads directly) in Python is nasty so I would avoid it and use some sort of message passing or reactor based programming.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM