ThreadPoolExecutor in recursive function

Question

I am trying to make a network spider that can submit multiple requests in parallel using ThreadPoolExecutor. If it's just one level, the problem is simple enough, but I want to exhaustively harvest a directory, which raises a recursion problem. My program is running normally if I don't do multiple threads. However, things go wrong when I tried to use ThreadPoolExecutor. Below is the code

class Spider:
    executor = None

    def __init__(self):
        self.executor = ThreadPoolExecutor(max_workers=20)

    def crawl(self, root_url):
        self.recursive_harvest_subroutine(root_url)
        self.executor.shutdown()

    def recursive_harvest_subroutine(self, url):
        children = get_direct_subdirs(url)
        if len(children) == 0:
            queue_url_to_do_something_later(url)  # Done
        else:
            for child_url in children:
                self.executor.submit(self.recursive_harvest_subroutine, child_url)

then I call the spider with

Spider().crawl(some_url)

The spider can only crawl the first level (direct children of some_url ) but not level 2+ directories.

If I just create a new ThreadPoolExecutor for each level, the spider also crawls correctly, at the cost of an explosive number of threads which soon crash my computer.

Answer 1

Okay so I thought the ThreadPoolExecutor would act as a semaphore and limit the number of threads. It does not. I just rewrite the thing using a real semaphore from threading and now it's working again.

ThreadPoolExecutor in recursive function

Question

1 answers

solution1
0 2019-07-24 09:57:56

ThreadPoolExecutor in recursive function

Question

1 answers

solution1 0 2019-07-24 09:57:56

solution1
0 2019-07-24 09:57:56