Too many threads in python threading - Recursive traversal

Question

I have a script to traverse an AWS S3 bucket to do some aggregation at the file level.

from threading import Semaphore, Thread
class Spider:
    def __init__(self):
        self.sem = Semaphore(120)
        self.threads = list()

    def crawl(self, root_url):
        self.recursive_harvest_subroutine(root_url)
        for thread in self.threads:
            thread.join()

    def recursive_harvest_subroutine(self, url):
        children = get_direct_subdirs(url)
        self.sem.acquire()
        if len(children) == 0:
            queue_url_to_do_something_later(url)  # Done
        else:
            for child_url in children:
                try:
                    thread = Thread(target=self.recursive_harvest_subroutine, args=(url,))
                    self.threads.append(thread)
                    thread.start()
        self.sem.release()

This used to run okay, until I encountered a bucket of several TB of data with hundreds of thousand sub-directories. The number of Thread objects in self.threads increases very fast and soon the server reported to me

RuntimeError: can't start new thread

There is some extra processing I have to do in the script so I can't just get all files from the bucket.

Currently I'm putting a depth of at least 2 before the script can go parallelized but it's just a workaround. Any suggestion is appreciated.

Answer 1

So the way the original piece of code worked was BFS, which created a lot of waiting threads in queue. I changed it to DFS and everything is working fine. Pseudo code in case someone needs this in the future:

    def __init__(self):
        self.sem = Semaphore(120)
        self.urls = list()
        self.mutex = Lock()

    def crawl(self, root_url):
        self.recursive_harvest_subroutine(root_url)
        while not is_done():
            self.sem.acquire()
            url = self.urls.pop(0)
            thread = Thread(target=self.recursive_harvest_subroutine, args=(url,))
            thread.start()
            self.sem.release()

    def recursive_harvest_subroutine(self, url):
        children = get_direct_subdirs(url)
        if len(children) == 0:
            queue_url_to_do_something_later(url)  # Done
        else:
            self.mutex.acquire()
            for child_url in children:
                self.urls.insert(0, child_url)
            self.mutex.release()

No join() so I implemented my own is_done() check.

Too many threads in python threading - Recursive traversal

Question

1 answers

solution1
0 2019-08-06 18:24:03

Too many threads in python threading - Recursive traversal

Question

1 answers

solution1 0 2019-08-06 18:24:03

solution1
0 2019-08-06 18:24:03