I am trying to make a network spider that can submit multiple requests in parallel using ThreadPoolExecutor. If it's just one level, the problem is simple enough, but I want to exhaustively harvest a directory, which raises a recursion problem. My program is running normally if I don't do multiple threads. However, things go wrong when I tried to use ThreadPoolExecutor. Below is the code
class Spider:
executor = None
def __init__(self):
self.executor = ThreadPoolExecutor(max_workers=20)
def crawl(self, root_url):
self.recursive_harvest_subroutine(root_url)
self.executor.shutdown()
def recursive_harvest_subroutine(self, url):
children = get_direct_subdirs(url)
if len(children) == 0:
queue_url_to_do_something_later(url) # Done
else:
for child_url in children:
self.executor.submit(self.recursive_harvest_subroutine, child_url)
then I call the spider with
Spider().crawl(some_url)
The spider can only crawl the first level (direct children of some_url
) but not level 2+ directories.
If I just create a new ThreadPoolExecutor for each level, the spider also crawls correctly, at the cost of an explosive number of threads which soon crash my computer.
Okay so I thought the ThreadPoolExecutor would act as a semaphore and limit the number of threads. It does not. I just rewrite the thing using a real semaphore from threading and now it's working again.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.