简体   繁体   English

使用多处理时有一个全局变量是否有效?

[英]is it efficient to have a global variable when using multiprocessing?

Please consider this cool setup: 请考虑这个很酷的设置:

from multiprocessing import Pool, cpu_count
import pandas as pd
import numpy as np

def helper(master_df):
    max_index = master_df['key'].max()
    min_index = master_df['key'].min()
    #note how slave is defined before running the multiprocessing
    return slave.iloc[min_index:max_index,]

from datetime import datetime

master = pd.DataFrame({'key': [1,2,3,4,5,6,7,8,9,10]})
slave = pd.DataFrame({'key': [1,2,3,4,5,6,7,8,9,10],
                      'value' : ['a','b','c','d','e','f','g','h','i','j']})

if __name__ == '__main__':
     startTime = datetime.now()
     p = Pool(cpu_count() - 1)
     ret_list = p.map(helper, [master.iloc[1:5,], master.iloc[5:10,]])
     print datetime.now() - startTime
     print ret_list

Essentially, I have two dataframes in memory. 基本上,我在内存中有两个数据帧。

As you can see in the main multiprocessing code, p.map receives, as arguments, two chunks of the master dataframe. 正如您在主要多处理代码中看到的那样, p.map接收master数据帧的两个块作为参数。

Then, (I imagine) each process spawned by multiprocessing will access the slave dataframe and use it (without modifying). 然后,(我想)由multiprocessing处理产生的每个进程将访问slave数据帧并使用它(不进行修改)。 Indeed, you can see in the helper function that each process will slice the slave dataframe and do some computation with it. 实际上,您可以在helper函数中看到每个进程将对slave数据帧进行slice并对其进行一些计算。

My question is: is it efficient to have a dataframe defined in the global namespace that is accessed by each process? 我的问题是:在每个进程访问的全局命名空间中定义数据帧是否有效? I am not sure what happens in terms of RAM utilization (is slave duplicated in memory for each process?). 我不知道在RAM利用率方面会发生什么(是slave内存中各工序重复?)。 That would not be a good idea, because in reality both master and slave are really big. 这不是一个好主意,因为实际上masterslave都很大。

I guess one alternative would be to send a tuple to p.map that contains the chunked master AND the corresponding sliced slave dataframe. 我想一个替代方法是将一个tuple发送到p.map ,其中包含p.map master和相应的切片slave数据帧。 Not sure is this is a good idea (and how to do it properly)? 不确定这是一个好主意(以及如何正确地做到这一点)?

Any ideas? 有任何想法吗? Thanks! 谢谢!

This surprisingly depends on the operating system, as multiprocessing is implemented differently in Windows and Linux . 这出乎意料地取决于操作系统,因为multiprocessing在Windows和Linux中的实现方式不同

  • In Linux, under the hood, processes are created via a fork variant where the child process shares the same address as the parent, initially, and then performs COW (copy on write). 在Linux中,通过fork变体创建进程,其中子进程最初与父进程共享相同的地址,然后执行COW(写入时复制)。 Under Linux, I've often had child processes access a read-only global DataFrame, and everything was fine (including performance). 在Linux下,我经常让子进程访问一个只读的全局DataFrame,一切都很好(包括性能)。

  • In Windows, under the hood, apparently, a whole process is spun up, and you might have a performance penalty of copying the DataFrame to it (unless processing done by it is large enough to render the cost negligible), but I haven't ever used Python on Windows, so have no experience with it. 在Windows中,很明显,整个过程都会被启动,你可能会有将数据框复制到它的性能损失(除非由它完成的处理足以使成本可以忽略不计),但我还没有曾经在Windows上使用过Python,所以没有经验。

Edit 编辑

An example of using joblib with DataFrames: joblib与DataFrames joblib使用的示例:

import joblib
import pandas as pd

df = pd.DataFrame(dict(a=[1, 3], b=[2, 3]))

def foo(i, df):
    return df + i

from joblib import Parallel, delayed
Parallel(n_jobs=2)(delayed(foo)(i, df) for i in range(10))

You could also use df as a global: 你也可以使用df作为全局:

def foo(i):
    return df + i

from joblib import Parallel, delayed
Parallel(n_jobs=2)(delayed(foo)(i) for i in range(10))

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM