在不耗尽 RAM 的情况下使用并发 Futures

Question

我正在做一些文件解析，这是一个 CPU 绑定的任务。 无论我在这个过程中扔了多少文件，它使用的 RAM 都不超过大约 50MB。 该任务是可并行的，我已将其设置为使用下面的并发期货将每个文件解析为一个单独的进程：

    from concurrent import futures
    with futures.ProcessPoolExecutor(max_workers=6) as executor:
        # A dictionary which will contain a list the future info in the key, and the filename in the value
        jobs = {}

        # Loop through the files, and run the parse function for each file, sending the file-name to it.
        # The results of can come back in any order.
        for this_file in files_list:
            job = executor.submit(parse_function, this_file, **parser_variables)
            jobs[job] = this_file

        # Get the completed jobs whenever they are done
        for job in futures.as_completed(jobs):

            # Send the result of the file the job is based on (jobs[job]) and the job (job.result)
            results_list = job.result()
            this_file = jobs[job]

            # delete the result from the dict as we don't need to store it.
            del jobs[job]

            # post-processing (putting the results into a database)
            post_process(this_file, results_list)

问题是，当我使用期货运行它时，RAM 使用率会飙升，不久我就用完了，Python 崩溃了。 这可能在很大程度上是因为 parse_function 的结果大小为几 MB。 一旦结果通过post_processing ，应用程序就不再需要它们了。 如您所见，我正在尝试使用del jobs[job]来清除jobs项目，但这没有任何区别，内存使用量保持不变，并且似乎以相同的速度增加。

我也确认这不是因为它只使用一个进程来等待post_process函数，再加上一个time.sleep(1) 。

期货文档中没有关于内存管理的任何内容，虽然简短的搜索表明它之前已经出现在期货的实际应用中（在 python 循环和http://grokbase.com/t/python/python-list 中清除内存） /1458ss5etz/real-world-use-of-concurrent-futures ） - 答案并没有转化为我的用例（他们都关心超时等）。

那么，如何在不耗尽 RAM 的情况下使用并发期货？ （Python 3.5）

Answer 1

我来试一试（可能是猜错了...）

您可能需要一点一点地提交您的工作，因为在每次提交时，您都在制作 parser_variables 的副本，这可能最终会占用您的 RAM。

这是有趣的部分带有“<----”的工作代码

with futures.ProcessPoolExecutor(max_workers=6) as executor:
    # A dictionary which will contain a list the future info in the key, and the filename in the value
    jobs = {}

    # Loop through the files, and run the parse function for each file, sending the file-name to it.
    # The results of can come back in any order.
    files_left = len(files_list) #<----
    files_iter = iter(files_list) #<------

    while files_left:
        for this_file in files_iter:
            job = executor.submit(parse_function, this_file, **parser_variables)
            jobs[job] = this_file
            if len(jobs) > MAX_JOBS_IN_QUEUE:
                break #limit the job submission for now job

        # Get the completed jobs whenever they are done
        for job in futures.as_completed(jobs):

            files_left -= 1 #one down - many to go...   <---

            # Send the result of the file the job is based on (jobs[job]) and the job (job.result)
            results_list = job.result()
            this_file = jobs[job]

            # delete the result from the dict as we don't need to store it.
            del jobs[job]

            # post-processing (putting the results into a database)
            post_process(this_file, results_list)
            break; #give a chance to add more jobs <-----

Answer 2

尝试将del添加到您的代码中，如下所示：

for job in futures.as_completed(jobs):
    del jobs[job]  # or `val = jobs.pop(job)`
    # del job  # or `job._result = None`

Answer 3

对我来说同样的问题。

就我而言，我需要启动数百万个线程。 对于python2，我会使用字典自己编写一个线程池。 但是在python3中，当我动态删除完成的线程时遇到以下错误：

RuntimeError: dictionary changed size during iteration

所以我必须使用concurrent.futures，一开始我是这样编码的：

from concurrent.futures import ThreadPoolExecutor
......
if __name__ == '__main__':
    all_resouces = get_all_resouces()
    with ThreadPoolExecutor(max_workers=50) as pool:
        for r in all_resouces:
            pool.submit(handle_resource, *args)

但是很快内存就耗尽了，因为只有在所有线程完成后才会释放内存。 我需要在许多线程开始之前删除已完成的线程。 所以我在这里阅读了文档： https : //docs.python.org/3/library/concurrent.futures.html#module-concurrent.futures

发现 Executor.shutdown(wait=True) 可能是我需要的。 这是我的最终解决方案：

from concurrent.futures import ThreadPoolExecutor
......
if __name__ == '__main__':
    all_resouces = get_all_resouces()
    i = 0
    while i < len(all_resouces):
        with ThreadPoolExecutor(max_workers=50) as pool:
            for r in all_resouces[i:i+1000]:
                pool.submit(handle_resource, *args)
            i += 1000

如果使用 with 语句，则可以避免显式调用此方法，这将关闭 Executor（等待，就像调用 Executor.shutdown() 时将 wait 设置为 True 一样）

Answer 4

查看concurrent.futures.as_completed()函数，我了解到它足以确保不再有任何对未来的引用。 如果您在获得结果后立即分配此引用，则会最大程度地减少内存使用量。

我使用set来存储我的Future实例，因为我关心的一切都已经由Future在其结果中返回（基本上，分派工作的状态）。 其他实现使用dict ，例如在您的情况下，因为您不会将输入文件名作为线程工作人员结果的一部分返回。

使用set意味着我可以使用remove()方法来销毁对Future的引用。 在内部， as_completed()在将完成的Future给您之后，已经负责删除自己的引用。

futures = set(executor.submit(thread_worker, work) for work in workload)

for future in concurrent.futures.as_completed(futures):
    futures.remove(future)  # the future's work payload memory usage can be freed now
    output = future.result()
    ...  # on next loop iteration, garbage will be collected for the result data, too

在不耗尽 RAM 的情况下使用并发 Futures

问题描述

4 个解决方案

解决方案1
14 已采纳 2016-01-13 15:57:09

解决方案2
6 2019-06-07 01:45:05

解决方案3
1 2021-08-19 05:10:25

解决方案4
0 2021-12-14 23:13:40

在不耗尽 RAM 的情况下使用并发 Futures

问题描述

4 个解决方案

解决方案1 14 已采纳 2016-01-13 15:57:09

解决方案2 6 2019-06-07 01:45:05

解决方案3 1 2021-08-19 05:10:25

解决方案4 0 2021-12-14 23:13:40

解决方案1
14 已采纳 2016-01-13 15:57:09

解决方案2
6 2019-06-07 01:45:05

解决方案3
1 2021-08-19 05:10:25

解决方案4
0 2021-12-14 23:13:40