I'm processing all the files in a directory using multiple threads to process files in parallel. It all works fine, except that threads seem to stay alive and so the thread count of the process goes up until it reaches 1K or so threads and then it throws a thread.error can't start new thread
error. I know this error is caused by an OS-level limit on thread count. I can't seem to figure out where the bug is that is keeping the threads alive. Any idea? Here is a minimal version of my code.
class Worker(Thread):
def __init__(self, tasks):
Thread.__init__(self)
self.tasks = tasks
self.daemon = True
self.start()
def run(self):
while True:
func, args, kargs = self.tasks.get()
try:
func(*args, **kargs)
except Exception, e: print e
self.tasks.task_done()
class ThreadPool:
def __init__(self, num_threads):
self.tasks = Queue(num_threads)
for _ in range(num_threads): Worker(self.tasks)
def add_task(self, func, *args, **kargs):
self.tasks.put((func, args, kargs))
def wait_completion(self):
self.tasks.join()
def foo(filename)
pool = ThreadPool(32)
iterable_data = process_file(filename)
for data in iterable_data:
pool.add_task(some_function, data)
pool.wait_completion()
files = os.listdir(directory)
for file in files:
foo(file)
You are launching a new ThreadPool with 32 threads for every file. If you have a large number of files, that would be a lot of threads. And since only one thread at a time can be executing Python bytecode in CPython (because of the Global Interpreter Lock), it is not necessarily very fast.
Move the creation of the ThreadPool outside of the foo()
function.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.