I have a script to traverse an AWS S3 bucket to do some aggregation at the file level.
from threading import Semaphore, Thread
class Spider:
def __init__(self):
self.sem = Semaphore(120)
self.threads = list()
def crawl(self, root_url):
self.recursive_harvest_subroutine(root_url)
for thread in self.threads:
thread.join()
def recursive_harvest_subroutine(self, url):
children = get_direct_subdirs(url)
self.sem.acquire()
if len(children) == 0:
queue_url_to_do_something_later(url) # Done
else:
for child_url in children:
try:
thread = Thread(target=self.recursive_harvest_subroutine, args=(url,))
self.threads.append(thread)
thread.start()
self.sem.release()
This used to run okay, until I encountered a bucket of several TB of data with hundreds of thousand sub-directories. The number of Thread objects in self.threads increases very fast and soon the server reported to me
RuntimeError: can't start new thread
There is some extra processing I have to do in the script so I can't just get all files from the bucket.
Currently I'm putting a depth of at least 2 before the script can go parallelized but it's just a workaround. Any suggestion is appreciated.
So the way the original piece of code worked was BFS, which created a lot of waiting threads in queue. I changed it to DFS and everything is working fine. Pseudo code in case someone needs this in the future:
def __init__(self):
self.sem = Semaphore(120)
self.urls = list()
self.mutex = Lock()
def crawl(self, root_url):
self.recursive_harvest_subroutine(root_url)
while not is_done():
self.sem.acquire()
url = self.urls.pop(0)
thread = Thread(target=self.recursive_harvest_subroutine, args=(url,))
thread.start()
self.sem.release()
def recursive_harvest_subroutine(self, url):
children = get_direct_subdirs(url)
if len(children) == 0:
queue_url_to_do_something_later(url) # Done
else:
self.mutex.acquire()
for child_url in children:
self.urls.insert(0, child_url)
self.mutex.release()
No join()
so I implemented my own is_done()
check.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.