[英]is it efficient to have a global variable when using multiprocessing?
Please consider this cool setup: 请考虑这个很酷的设置:
from multiprocessing import Pool, cpu_count
import pandas as pd
import numpy as np
def helper(master_df):
max_index = master_df['key'].max()
min_index = master_df['key'].min()
#note how slave is defined before running the multiprocessing
return slave.iloc[min_index:max_index,]
from datetime import datetime
master = pd.DataFrame({'key': [1,2,3,4,5,6,7,8,9,10]})
slave = pd.DataFrame({'key': [1,2,3,4,5,6,7,8,9,10],
'value' : ['a','b','c','d','e','f','g','h','i','j']})
if __name__ == '__main__':
startTime = datetime.now()
p = Pool(cpu_count() - 1)
ret_list = p.map(helper, [master.iloc[1:5,], master.iloc[5:10,]])
print datetime.now() - startTime
print ret_list
Essentially, I have two dataframes in memory. 基本上,我在内存中有两个数据帧。
As you can see in the main multiprocessing code, p.map
receives, as arguments, two chunks of the master
dataframe. 正如您在主要多处理代码中看到的那样, p.map
接收master
数据帧的两个块作为参数。
Then, (I imagine) each process spawned by multiprocessing
will access the slave
dataframe and use it (without modifying). 然后,(我想)由multiprocessing
处理产生的每个进程将访问slave
数据帧并使用它(不进行修改)。 Indeed, you can see in the helper
function that each process will slice
the slave
dataframe and do some computation with it. 实际上,您可以在helper
函数中看到每个进程将对slave
数据帧进行slice
并对其进行一些计算。
My question is: is it efficient to have a dataframe defined in the global namespace that is accessed by each process? 我的问题是:在每个进程访问的全局命名空间中定义数据帧是否有效? I am not sure what happens in terms of RAM utilization (is slave
duplicated in memory for each process?). 我不知道在RAM利用率方面会发生什么(是slave
内存中各工序重复?)。 That would not be a good idea, because in reality both master
and slave
are really big. 这不是一个好主意,因为实际上master
和slave
都很大。
I guess one alternative would be to send a tuple
to p.map
that contains the chunked master
AND the corresponding sliced slave
dataframe. 我想一个替代方法是将一个tuple
发送到p.map
,其中包含p.map
master
和相应的切片slave
数据帧。 Not sure is this is a good idea (and how to do it properly)? 不确定这是一个好主意(以及如何正确地做到这一点)?
Any ideas? 有任何想法吗? Thanks! 谢谢!
This surprisingly depends on the operating system, as multiprocessing
is implemented differently in Windows and Linux . 这出乎意料地取决于操作系统,因为multiprocessing
在Windows和Linux中的实现方式不同 。
In Linux, under the hood, processes are created via a fork
variant where the child process shares the same address as the parent, initially, and then performs COW (copy on write). 在Linux中,通过fork
变体创建进程,其中子进程最初与父进程共享相同的地址,然后执行COW(写入时复制)。 Under Linux, I've often had child processes access a read-only global DataFrame, and everything was fine (including performance). 在Linux下,我经常让子进程访问一个只读的全局DataFrame,一切都很好(包括性能)。
In Windows, under the hood, apparently, a whole process is spun up, and you might have a performance penalty of copying the DataFrame to it (unless processing done by it is large enough to render the cost negligible), but I haven't ever used Python on Windows, so have no experience with it. 在Windows中,很明显,整个过程都会被启动,你可能会有将数据框复制到它的性能损失(除非由它完成的处理足以使成本可以忽略不计),但我还没有曾经在Windows上使用过Python,所以没有经验。
Edit 编辑
An example of using joblib
with DataFrames: 将joblib
与DataFrames joblib
使用的示例:
import joblib
import pandas as pd
df = pd.DataFrame(dict(a=[1, 3], b=[2, 3]))
def foo(i, df):
return df + i
from joblib import Parallel, delayed
Parallel(n_jobs=2)(delayed(foo)(i, df) for i in range(10))
You could also use df
as a global: 你也可以使用df
作为全局:
def foo(i):
return df + i
from joblib import Parallel, delayed
Parallel(n_jobs=2)(delayed(foo)(i) for i in range(10))
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.