在 100K 文件上比较 Python 多处理和多线程

Question

我有一个文件夹，里面有 100K 个文件，总共 50GB。 目标是读取每个文件运行一些正则表达式来存储数据。 我正在尝试运行测试以查看哪种方法，多线程或多处理，将是最理想的。

我使用的服务器有 4 个内核和 8GB RAM。 如果没有任何多线程，完成任务大约需要 5 分钟。

 from concurrent.futures import ThreadPoolExecutor threads= [] def read_files(filename): with open(filename, 'r') as f: text = f.read() with ThreadPoolExecutor(max_workers=50) as executor: for filename in glob.iglob('/root/my_app/my_app_venv/raw_files/*.txt', recursive=True): threads.append(executor.submit(read_files, filename))

多线程平均为 1 分 30 秒。

现在我正在尝试为 Multiprocessing 设置测试并使用服务器上的 4 个内核而没有任何结果。

 from multiprocessing import Lock, Process, Queue, current_process import time import queue def read_files(tasks_to_accomplish): while True: try: filename = tasks_to_accomplish.get_nowait() with open(filename, 'r') as f: text = f.read() except queue.Empty: break def main(): number_of_processes = 4 tasks_to_accomplish = Queue() processes = [] for filename in glob.iglob('/root/my_app/my_app_venv/raw_files/*.txt', recursive=True): tasks_to_accomplish.put(filename) # creating processes for w in range(number_of_processes): p = Process(target=read_files, args=(tasks_to_accomplish,)) processes.append(p) p.start() # completing process for p in processes: p.join() if __name__ == '__main__': main()

请帮忙！

Answer 1

由于您已经在使用concurrent.futures ，我建议使用ProcessPoolExecutor ，它位于multiprocessing之上，类似于ThreadPoolExecutor位于threading之上。 这些类具有几乎相同的 API

https://docs.python.org/3/library/concurrent.futures.html#processpoolexecutor

在 100K 文件上比较 Python 多处理和多线程

问题描述

1 个解决方案

解决方案1
1 2020-08-26 02:35:02

在 100K 文件上比较 Python 多处理和多线程

问题描述

1 个解决方案

解决方案1 1 2020-08-26 02:35:02

解决方案1
1 2020-08-26 02:35:02