I'm writing a very basic multi-threaded web crawler written in Python, and using a While loop for the function that crawls a page and extract urls, as follows:
def crawl():
while True:
try:
p = Page(pool.get(True, 10))
except Queue.Empty:
continue
# then extract urls from a page and put new urls into the queue
(The complete source code is here in another question: Multi-threaded Python Web Crawler Got Stuck )
Now ideally, I want to add a condition to the While loop, to make the while loop exit when:
the pool (the Queue object that stores the urls) is empty, and;
all threads are blocking, waiting to get a url from the queue (which means no threads is putting new urls into the pool, so keeping them waiting doesn't make sense and will make my program stuck.)
For example, something like:
#thread-1.attr == 1 means the thread-1 is blocking. 0 means not blocking
while not (pool.empty() and (thread-1.attr == 1 and thread-2.attr == 1 and ...)):
#do the crawl stuff
So I'm wondering if there is a away for a thread to check what other active threads are doing, or the status or value of an attribute of other active threads.
I have read the official document on threading.Event(), but still can't figure it out.
Hope someone here can point me the way :)
Thank you very much!
Marcus
You can try to implement what you want from scratch, there is different solutions that come to my mind right now:
If you want to not reinvent the wheel you can use existing libraries that implement a thread pool or you can also check gevent which use green thread and offer also a thread pool , i have implemented something similar to this using something like:
while 1:
try:
url = queue.get_nowait()
except Empty:
# Check that all threads are done.
if pool.free_count() == pool.size:
break
...
You can also write a sentinel object to your queue that mark the finish of crawling and exist your main loop and wait for threads to finish (using a pool for example).
while 1:
try:
url = queue.get_nowait()
# StopIteration mark that no url will be added to the queue anymore.
if url is StopIteration:
break
except Empty:
continue
...
pool.join()
You can choose the one that you prefer, and hopefully this was helpful.
Consider looking at this solution: Web crawler Using Twisted . As the answer to that question says I would also recommend you look at http://scrapy.org/
Multi-threading (using threads directly) in Python is nasty so I would avoid it and use some sort of message passing or reactor based programming.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.