使用多处理和线程并行处理非常大的文本文件

Question

I have found several other questions that touch on this topic but none that are quite like my situation.我发现了其他几个涉及该主题的问题，但没有一个与我的情况非常相似。

I have several very large text files (3+ gigabytes in size).我有几个非常大的文本文件（大小超过 3 GB）。

I would like to process them (say 2 documents) in parallel using multiprocessing .我想使用multiprocessing并行处理它们（比如 2 个文档）。 As part of my processing (within a single process) I need to make an API call and because of this would like to have each process have it's own threads to run asynchronously.作为我处理的一部分（在单个进程中），我需要进行 API 调用，因此希望每个进程都有自己的threads来异步运行。

I have came up with a simplified example ( I have commented the code to try to explain what I think it should be doing):我想出了一个简化的例子（我已经评论了代码以试图解释我认为它应该做什么）：

import multiprocessing
from threading import Thread
import threading
from queue import Queue
import time


def process_huge_file(*, file_, batch_size=250, num_threads=4):
    # create  APICaller instance for each process that has it's own Queue
    api_call = APICaller()

    batch = []

    # create threads that will run asynchronously to make API calls
    # I expect these to immediately block since there is nothing in the Queue (which is was
    # the api_call.run depends on to make a call 
    threads = []
    for i in range(num_threads):
        thread = Thread(target=api_call.run)
        threads.append(thread)
        thread.start()

    for thread in threads:
        thread.join()
    ####
    # start processing the file line by line
    for line in file_:
        # if we are at our batch size, add the batch to the api_call to to let the threads do 
        # their api calling 
        if i % batch_size == 0:
            api_call.queue.put(batch)
        else:
        # add fake line to batch
            batch.append(fake_line)


class APICaller:
    def __init__(self):
    # thread safe queue to feed the threads which point at instances
    of these APICaller objects
        self.queue = Queue()

    def run(self):
        print("waiting for something to do")
        self.queue.get()
        print("processing item in queue")
        time.sleep(0.1)
        print("finished processing item in queue")




if __name__ == "__main__":
    # fake docs
    fake_line = "this is a fake line of some text"
    # two fake docs with line length == 1000
    fake_docs = [[fake_line] * 1000 for i in range(2)]
    ####
    num_processes = 2
    procs = []
    for idx, doc in enumerate(fake_docs):
        proc = multiprocessing.Process(target=process_huge_file, kwargs=dict(file_=doc))
        proc.start()
        procs.append(proc)

    for proc in procs:
        proc.join()

As the code is now, "waiting for something to do" prints 8 times (makes sense 4 threads per process) and then it stops or "deadlocks" which is not what I expect - I expect it to start sharing time with the threads as soon as I start putting items in the Queue but the code does not appear to make it this far.正如现在的代码一样，“等待某事做”打印 8 次（每个进程有 4 个线程有意义）然后它停止或“死锁”，这不是我所期望的 - 我希望它开始与线程共享时间一旦我开始将项目放入队列中，但代码似乎并没有做到这一点。 I ordinarily would step through to find a hang up but I still don't have a solid understanding of how to best debug using Threads (another topic for another day).我通常会逐步找到一个挂断，但我仍然对如何使用Threads进行最佳调试没有深入的了解（另一天的另一个主题）。

In the meantime, can someone help me figure out why my code is not doing what it should be doing?与此同时，有人可以帮我弄清楚为什么我的代码没有做它应该做的事情吗？

Answer 1

I have made a few adjustments and additions and the code appears to do what it is supposed to now.我进行了一些调整和添加，代码似乎可以做到现在应该做的事情。 The main adjustments are: adding a CloseableQueue class (from Brett Slatkins Effective Python Item 55), and ensuring that I call close and join on the queue so that the threads properly exit.主要的调整是：添加一个CloseableQueue class（来自 Brett Slatkins Effective Python Item 55），并确保我在队列上调用 close 和 join 以便线程正确退出。 Full code with these changes below:具有以下这些更改的完整代码：

import multiprocessing
from threading import Thread
import threading
from queue import Queue
import time

from concurrency_utils import CloseableQueue


def sync_process_huge_file(*, file_, batch_size=250):
    batch = []
    for idx, line in enumerate(file_):
        # do processing on the text
        if idx % batch_size == 0:
            time.sleep(0.1)
            batch = []
            # api_call.queue.put(batch)
        else:
            computation = 0
            for i in range(100000):
                computation += i
            batch.append(line)


def process_huge_file(*, file_, batch_size=250, num_threads=4):
    api_call = APICaller()

    batch = []

    # api call threads
    threads = []
    for i in range(num_threads):
        thread = Thread(target=api_call.run)
        threads.append(thread)
        thread.start()

    for idx, line in enumerate(file_):
        # do processing on the text
        if idx % batch_size == 0:
            api_call.queue.put(batch)
        else:
            computation = 0
            for i in range(100000):
                computation += i
            batch.append(line)

    for _ in threads:
        api_call.queue.close()
    api_call.queue.join()

    for thread in threads:
        thread.join()


class APICaller:
    def __init__(self):
        self.queue = CloseableQueue()

    def run(self):
        for item in self.queue:
            print("waiting for something to do")
            pass
            print("processing item in queue")
            time.sleep(0.1)
            print("finished processing item in queue")
        print("exiting run")


if __name__ == "__main__":
    # fake docs
    fake_line = "this is a fake line of some text"
    # two fake docs with line length == 1000
    fake_docs = [[fake_line] * 10000 for i in range(2)]
    ####
    time_s = time.time()
    num_processes = 2
    procs = []
    for idx, doc in enumerate(fake_docs):
        proc = multiprocessing.Process(target=process_huge_file, kwargs=dict(file_=doc))
        proc.start()
        procs.append(proc)

    for proc in procs:
        proc.join()

    time_e = time.time()

    print(f"took {time_e-time_s} ")


class CloseableQueue(Queue):
    SENTINEL = object()

    def __init__(self, **kwargs):
        super().__init__(**kwargs)

    def close(self):
        self.put(self.SENTINEL)

    def __iter__(self):
        while True:
            item = self.get()
            try:
                if item is self.SENTINEL:
                    return  # exit thread
                yield item
            finally:
                self.task_done()

As expected this is a great speedup from running synchronously - 120 seconds vs 50 seconds.正如预期的那样，这是同步运行的一个很大的加速 - 120 秒与 50 秒。

使用多处理和线程并行处理非常大的文本文件

问题描述

1 个解决方案

解决方案1
0 2020-06-11 23:34:07

使用多处理和线程并行处理非常大的文本文件

问题描述

1 个解决方案

解决方案1 0 2020-06-11 23:34:07

解决方案1
0 2020-06-11 23:34:07