python concurrent.futures.ProcessPoolExecutor 因 RAM 满而崩溃

Question

Python concurrent.futures.ProcessPoolExecutor crashing with full RAM Python concurrent.futures.ProcessPoolExecutor 因 RAM 满而崩溃

Program description程序说明

Hi, I've got a computationally heavy function which I want to run in parallel.嗨，我有一个计算量很大的 function，我想并行运行它。 The function is a test that accepts as inputs: function 是接受输入的测试：

a DataFrame to test on一个 DataFrame 来测试
parameters based on which the calculations will be ran.运行计算所基于的参数。

The return value is a short list of calculation results.返回值是一个简短的计算结果列表。

I want to run the same function in a for loop with different parameters and the same input DataFrame, basically run a brute-force to find optimal parameters for my problem.我想在具有不同参数和相同输入 DataFrame 的 for 循环中运行相同的 function，基本上运行蛮力来为我的问题找到最佳参数。

The code I've written我写的代码

I currently am running the code concurrently with ProcessPoolExecutor from the module concurrent.futures.我目前正在与模块 concurrent.futures 中的 ProcessPoolExecutor 同时运行代码。

import concurrent.futures
from itertools import repeat
import pandas as pd

from my_tests import func


parameters = [
    (arg1, arg2, arg3),
    (arg1, arg2, arg3),
    ...
]
large_df = pd.read_csv(csv_path)

with concurrent.futures.ProcessPoolExecutor() as executor:
    for future in executor.map(func, repeat(large_df.copy()), parameters):
        test_result = future.result()
        ...

The problem问题

The problem I face is that I need to run a large amount of iterations, but my program crashes almost instantly.我面临的问题是我需要运行大量迭代，但我的程序几乎立即崩溃。

In on order for it not to crash, I need to limit it to max 4 workers, which is 1/4 of my CPU resources.为了不让它崩溃，我需要将它限制为最多 4 个 worker，这是我 CPU 资源的 1/4。

with concurrent.futures.ProcessPoolExecutor(max_workers=4) as executor:
    ...

I figured out my program crashes due to a full RAM (16 GB).我发现我的程序因满 RAM (16 GB) 而崩溃。 What I found weird is that when I was running it on more workers, it was gradually eating more and more RAM, which it never released, until it crashed.我发现奇怪的是，当我在更多的 worker 上运行它时，它逐渐消耗越来越多的 RAM，它从未释放，直到它崩溃。

Instead of passing a copy of the DataFrame, I tried to pass the file path, but apart of slowing down my program, it didn't change anything.我没有传递 DataFrame 的副本，而是尝试传递文件路径，但除了减慢我的程序之外，它没有改变任何东西。

Do you have any idea of why that problem occurs and how to solve it?您是否知道为什么会出现该问题以及如何解决？

Answer 1

See my comment on what map actually returns.请参阅我对map实际返回的内容的评论。

This answer is relevant according to how large your parameters list is, ie how many total tasks are being placed on the multiprocessing pool's task queue:根据您的parameters列表有多大，这个答案是相关的，即有多少任务被放置在多处理池的任务队列中：

You are currently creating and passing a copy of your dataframe (with large_df.copy() ) every time you are submitting a new task (one task for each element of parameters . One thing you can do is to initialize your pool processes once with a single copy per pool process that will be used by every task submitted and executed by the pool process. The assumption is that the dataframe itself is not modified by my_tests.func . If it is modified and you need a copy of the original large_df for each new task, the function worker (see below) can make the copy. In this case you would need 2 * N copies (instead of just N copies) to exist simultaneously where N is the number of processes in the pool. This will save you memory if the length of parameters is greater than that since in your code a copy of the dataframe will exist either on the task queue or in a pool process's address space.每次提交新任务时，您当前正在创建并传递 dataframe 的副本（使用large_df.copy() ）（每个parameters元素一个任务。您可以做的一件事是使用每个池进程的单个副本将由池进程提交和执行的每个任务使用。假设 dataframe 本身未被my_tests.func修改。如果它被修改并且您需要每个原始large_df的副本新任务，function worker （见下文）可以制作副本。在这种情况下，您需要同时存在 2 * N 个副本（而不是 N 个副本），其中N是池中的进程数。这将为您节省memory 如果parameters的长度大于该长度，因为在您的代码中，dataframe 的副本将存在于任务队列或池进程的地址空间中。

If you are running under a platform such as Linux that uses the fork method to create new processes, then each child process will inherit a copy automatcally as a global variable:如果你运行在Linux这样使用fork方式创建新进程的平台下，那么每个子进程都会自动继承一份作为全局变量：

import concurrent.futures
import pandas as pd

from my_tests import func


parameters = [
    (arg1, arg2, arg3),
    (arg1, arg2, arg3),
    ...
]

large_df = pd.read_csv(csv_path) # will be inherited

def worker(parameter):
    return func(large_df, parameter)
    """
    # or:
    return func(large_df.copy(), parameter)
    """

with concurrent.futures.ProcessPoolExecutor() as executor:
    for result in executor.map(worker, parameters):
        ...

my_tests.func is expecting as its first argument a dataframe, but with the above change the dataframe is no longer being passed; my_tests.func期望它的第一个参数是 dataframe，但是通过上述更改，不再传递 dataframe； the dataframe is now accessed as a global variable. dataframe 现在作为全局变量访问。 So without modifying func , we need am adapter function, worker , that will pass to func what it is expecting.因此，在不修改func的情况下，我们需要一个适配器 function, worker ，它将传递给func它所期望的。 Of course, if you are able to modify func , then you can do without the adapter.当然，如果您能够修改func ，那么您可以不使用适配器。

If you were running on a platform such as Windows that uses the spawn method to create new processes, then:如果您在使用spawn方法创建新进程的平台（例如 Windows）上运行，则：

import concurrent.futures
import pandas as pd

from my_tests import func

def init_pool_processes(df):
    global large_df
    large_df = df


def worker(parameter):
    return func(large_df, parameter)
    """
    # or:
    return func(large_df.copy(), parameter)
    """

if __name__ == '__main__':
    
    parameters = [
        (arg1, arg2, arg3),
        (arg1, arg2, arg3),
        ...
    ]
    
    large_df = pd.read_csv(csv_path) # will be inherited
    
    with concurrent.futures.ProcessPoolExecutor(initializer=init_pool_processes, initargs=(large_df,)) as executor:
        for result in executor.map(worker, parameters):
            ...

Answer 2

Combining the suggestions of Aaron and Booboo , the solution to my problem was indeed copying a large DataFrame every single time I called the function, which made my computer run out of memory. The quick solution I found was to delete the copy of the DataFrame at the end of func :结合Aaron和Booboo的建议，我的问题的解决方案确实是每次调用 function 时都复制一个大的 DataFrame，这使我的计算机用完了 memory。我找到的快速解决方案是删除 DataFrame 的副本func结束：

def func(large_df_copy, parameters):
    ...
    del large_df_copy

I will look into modifying Booboo's code, as the DataFrame is indeed modified in the function in order to run calculations.我将研究修改Booboo 的代码，因为 DataFrame 确实在 function 中进行了修改以运行计算。

Thank you very much for helping me out, I appreciate it a lot!非常感谢你帮助我，我非常感激！

python concurrent.futures.ProcessPoolExecutor 因 RAM 满而崩溃

问题描述

Python concurrent.futures.ProcessPoolExecutor crashing with full RAM Python concurrent.futures.ProcessPoolExecutor 因 RAM 满而崩溃

Program description程序说明

The code I've written我写的代码

The problem问题

2 个解决方案

解决方案1
1 已采纳 2022-11-15 18:17:42

解决方案2
0 2022-11-17 10:30:15

python concurrent.futures.ProcessPoolExecutor 因 RAM 满而崩溃

问题描述

Python concurrent.futures.ProcessPoolExecutor crashing with full RAM Python concurrent.futures.ProcessPoolExecutor 因 RAM 满而崩溃

Program description程序说明

The code I've written我写的代码

The problem问题

2 个解决方案

解决方案1 1 已采纳 2022-11-15 18:17:42

解决方案2 0 2022-11-17 10:30:15

解决方案1
1 已采纳 2022-11-15 18:17:42

解决方案2
0 2022-11-17 10:30:15