[英]processing very large text files in parallel using multiprocessing and threading
I have found several other questions that touch on this topic but none that are quite like my situation.我发现了其他几个涉及该主题的问题,但没有一个与我的情况非常相似。
I have several very large text files (3+ gigabytes in size).我有几个非常大的文本文件(大小超过 3 GB)。
I would like to process them (say 2 documents) in parallel using multiprocessing
.我想使用multiprocessing
并行处理它们(比如 2 个文档)。 As part of my processing (within a single process) I need to make an API call and because of this would like to have each process have it's own threads
to run asynchronously.作为我处理的一部分(在单个进程中),我需要进行 API 调用,因此希望每个进程都有自己的threads
来异步运行。
I have came up with a simplified example ( I have commented the code to try to explain what I think it should be doing):我想出了一个简化的例子(我已经评论了代码以试图解释我认为它应该做什么):
import multiprocessing
from threading import Thread
import threading
from queue import Queue
import time
def process_huge_file(*, file_, batch_size=250, num_threads=4):
# create APICaller instance for each process that has it's own Queue
api_call = APICaller()
batch = []
# create threads that will run asynchronously to make API calls
# I expect these to immediately block since there is nothing in the Queue (which is was
# the api_call.run depends on to make a call
threads = []
for i in range(num_threads):
thread = Thread(target=api_call.run)
threads.append(thread)
thread.start()
for thread in threads:
thread.join()
####
# start processing the file line by line
for line in file_:
# if we are at our batch size, add the batch to the api_call to to let the threads do
# their api calling
if i % batch_size == 0:
api_call.queue.put(batch)
else:
# add fake line to batch
batch.append(fake_line)
class APICaller:
def __init__(self):
# thread safe queue to feed the threads which point at instances
of these APICaller objects
self.queue = Queue()
def run(self):
print("waiting for something to do")
self.queue.get()
print("processing item in queue")
time.sleep(0.1)
print("finished processing item in queue")
if __name__ == "__main__":
# fake docs
fake_line = "this is a fake line of some text"
# two fake docs with line length == 1000
fake_docs = [[fake_line] * 1000 for i in range(2)]
####
num_processes = 2
procs = []
for idx, doc in enumerate(fake_docs):
proc = multiprocessing.Process(target=process_huge_file, kwargs=dict(file_=doc))
proc.start()
procs.append(proc)
for proc in procs:
proc.join()
As the code is now, "waiting for something to do" prints 8 times (makes sense 4 threads per process) and then it stops or "deadlocks" which is not what I expect - I expect it to start sharing time with the threads as soon as I start putting items in the Queue but the code does not appear to make it this far.正如现在的代码一样,“等待某事做”打印 8 次(每个进程有 4 个线程有意义)然后它停止或“死锁”,这不是我所期望的 - 我希望它开始与线程共享时间一旦我开始将项目放入队列中,但代码似乎并没有做到这一点。 I ordinarily would step through to find a hang up but I still don't have a solid understanding of how to best debug using Threads
(another topic for another day).我通常会逐步找到一个挂断,但我仍然对如何使用Threads
进行最佳调试没有深入的了解(另一天的另一个主题)。
In the meantime, can someone help me figure out why my code is not doing what it should be doing?与此同时,有人可以帮我弄清楚为什么我的代码没有做它应该做的事情吗?
I have made a few adjustments and additions and the code appears to do what it is supposed to now.我进行了一些调整和添加,代码似乎可以做到现在应该做的事情。 The main adjustments are: adding a CloseableQueue
class (from Brett Slatkins Effective Python Item 55), and ensuring that I call close and join on the queue so that the threads properly exit.主要的调整是:添加一个CloseableQueue
class(来自 Brett Slatkins Effective Python Item 55),并确保我在队列上调用 close 和 join 以便线程正确退出。 Full code with these changes below:具有以下这些更改的完整代码:
import multiprocessing
from threading import Thread
import threading
from queue import Queue
import time
from concurrency_utils import CloseableQueue
def sync_process_huge_file(*, file_, batch_size=250):
batch = []
for idx, line in enumerate(file_):
# do processing on the text
if idx % batch_size == 0:
time.sleep(0.1)
batch = []
# api_call.queue.put(batch)
else:
computation = 0
for i in range(100000):
computation += i
batch.append(line)
def process_huge_file(*, file_, batch_size=250, num_threads=4):
api_call = APICaller()
batch = []
# api call threads
threads = []
for i in range(num_threads):
thread = Thread(target=api_call.run)
threads.append(thread)
thread.start()
for idx, line in enumerate(file_):
# do processing on the text
if idx % batch_size == 0:
api_call.queue.put(batch)
else:
computation = 0
for i in range(100000):
computation += i
batch.append(line)
for _ in threads:
api_call.queue.close()
api_call.queue.join()
for thread in threads:
thread.join()
class APICaller:
def __init__(self):
self.queue = CloseableQueue()
def run(self):
for item in self.queue:
print("waiting for something to do")
pass
print("processing item in queue")
time.sleep(0.1)
print("finished processing item in queue")
print("exiting run")
if __name__ == "__main__":
# fake docs
fake_line = "this is a fake line of some text"
# two fake docs with line length == 1000
fake_docs = [[fake_line] * 10000 for i in range(2)]
####
time_s = time.time()
num_processes = 2
procs = []
for idx, doc in enumerate(fake_docs):
proc = multiprocessing.Process(target=process_huge_file, kwargs=dict(file_=doc))
proc.start()
procs.append(proc)
for proc in procs:
proc.join()
time_e = time.time()
print(f"took {time_e-time_s} ")
class CloseableQueue(Queue):
SENTINEL = object()
def __init__(self, **kwargs):
super().__init__(**kwargs)
def close(self):
self.put(self.SENTINEL)
def __iter__(self):
while True:
item = self.get()
try:
if item is self.SENTINEL:
return # exit thread
yield item
finally:
self.task_done()
As expected this is a great speedup from running synchronously - 120 seconds vs 50 seconds.正如预期的那样,这是同步运行的一个很大的加速 - 120 秒与 50 秒。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.