简体   繁体   English

python多处理中父进程全局变量如何复制到子进程

[英]How are parent process global variables copied to sub-processes in python multiprocessing

Ubuntu 20.04 Ubuntu 20.04

My understanding of global variable access by different sub-processes in python is this:我对 python 中不同子进程访问全局变量的理解是这样的:

  1. Global variables (let's say b ) are available to each sub-process in a copy-on-write capacity全局变量(比方说b )可用于写入时复制容量的每个子进程
  2. If a sub-process modifies that variable then a copy of b is first created and then that copy is modified.如果子进程修改该变量,则首先创建b的副本,然后修改该副本。 This change would not be visible to the parent process (I will ask a question on this part later)父进程看不到此更改(稍后我将在这部分提出问题)

I did a few experiments trying to understand when the object is getting copied.我做了一些实验,试图了解 object 何时被复制。 I could not conclude much:我无法得出太多结论:

Experiments:实验:

import numpy as np
import multiprocessing as mp
import psutil
b=np.arange(200000000).reshape(-1,100).astype(np.float64)

Then I tried to see how the memory consumption changes using the below-mentioned function:然后我尝试使用下面提到的 function 来查看 memory 消耗如何变化:

def f2():
    print(psutil.virtual_memory().used/(1024*1024*1024))
    global b
    print(psutil.virtual_memory().used/(1024*1024*1024))
    b = b + 1 ### I changed this statement to study the different memory behaviors. I am posting the results for different statements in place of b = b + 1.
    print(psutil.virtual_memory().used/(1024*1024*1024))

p2 = mp.Process(target=f2)
p2.start()
p2.join()

Results format:结果格式:

statement used in place of b = b + 1
print 1
print 2
print 3
Comments and questions

Results:结果:

b = b+1
6.571144104003906
6.57244873046875
8.082862854003906 
Only a copy-on-write view was provided so no memory consumption till it hit b = b+1. At which point a copy of b was created and hence the memory usage spike

b[:, 1] = b[:, 1] + 1
6.6118621826171875
6.613414764404297
8.108139038085938
Only a copy-on-write view was provided so no memory consumption till it hit b[:, 1] = b[:, 1] + 1. It seems that even if some part of the memory is to be updated (here just one column) the entire object would be copied. Seems fair (so far)

b[0, :] = b[0, :] + 1
6.580562591552734
6.581851959228516
6.582511901855469
NO MEMORY CHANGE! When I tried to modify a column it copied the entire b. But when I try to modify a row, it does not create a copy? Can you please explain what happened here?


b[0:100000, :] = b[0:100000, :] + 1
6.572498321533203
6.5740814208984375
6.656215667724609
Slight memory spike. Assuming a partial copy since I modified just the first 1/20th of the rows. But that would mean that while modifying a column as well some partial copy should have been created, unlike the full copy that we saw in case 2 above. No? Can you please explain what happened here as well?

b[0:500000, :] = b[0:500000, :] + 1
6.593017578125
6.594577789306641
6.970676422119141
The assumption of partial copy was right I think. A moderate memory spike to reflect the change in 1/4th of the total rows

b[0:1000000, :] = b[0:1000000, :] + 1
6.570674896240234
6.5723876953125
7.318485260009766
In-line with partial copy hypothesis


b[0:2000000, :] = b[0:2000000, :] + 1
6.594249725341797
6.596080780029297
8.087333679199219
A full copy since now we are modifying the entire array. This is equal to b = b + 1 only. Just that we have now referred using a slice of all the rows

b[0:2000000, 1] = b[0:2000000, 1] + 1
6.564876556396484
6.566963195800781
8.069766998291016
Again full copy. It seems in the case of row slices a partial copy is getting created and in the case of a column slice, a full copy is getting created which, is weird to me. Can you please help me understand what the exact copy semantics of global variables of a child process are?

As you can see I am not finding a way to justify the results that I am seeing up in the experiment setup I described.正如你所看到的,我没有找到一种方法来证明我在我描述的实验设置中看到的结果。 Can you please help me understand how global variables of the parent process are copied upon full/partial modifications by the child process?您能否帮助我了解子进程的全部/部分修改后如何复制父进程的全局变量?

I have also read that:我也读过

The child gets a copy-on-write view of the parent memory space.子进程获得父 memory 空间的写时复制视图。 As long as you load the dataset before firing the processes and you don't pass a reference to that memory space in the multiprocessing call (that is, workers should use the global variable directly), then there is no copy.只要您在触发进程之前加载数据集,并且您没有在多处理调用中传递对 memory 空间的引用(也就是说,工作人员应该直接使用全局变量),那么就没有副本。

Question 1: What does "As long as you load the dataset before firing the processes and you don't pass a reference to that memory space in the multiprocessing call (that is, workers should use the global variable directly), then there is no copy" mean?问题1:什么是“只要您在触发进程之前加载数据集并且您没有在多处理调用中传递对memory空间的引用(也就是说,工作人员应该直接使用全局变量),那么就没有复制”是什么意思?

As answered by Mr. Tim Roberts below, it means -正如下面蒂姆·罗伯茨先生所回答的那样,这意味着 -

If you pass the dataset as a parameter, then Python has to make a copy to transfer it over.如果您将数据集作为参数传递,则 Python 必须制作副本才能将其传输过来。 The parameter passing mechanism doesn't use copy-on-write, partly because the reference counting stuff would be confused.参数传递机制不使用写时复制,部分原因是引用计数的东西会被混淆。 When you create it as a global before things start, there's a solid reference, so the multiprocessing code can make copy-on-write happen.当您在事情开始之前将其创建为全局时,有一个可靠的参考,因此多处理代码可以使写时复制发生。

However, I am not able to verify this behavior.但是,我无法验证此行为。 Here are the few tests I ran to verify这是我运行以验证的几个测试

import numpy as np
import multiprocessing as mp
import psutil
b=np.arange(200000000).reshape(-1,100).astype(np.float64)

Then I tried to see how the memory consumption changes using the below-mentioned function:然后我尝试使用下面提到的 function 来查看 memory 消耗如何变化:

def f2(b): ### Please notice that the array is passed as an argument and not picked as the global variable of parent process
    print(psutil.virtual_memory().used/(1024*1024*1024))
    b = b + 1 ### I changed this statement to study the different memory behaviors. I am posting the results for different statements in place of b = b + 1.
    print(psutil.virtual_memory().used/(1024*1024*1024))

print(psutil.virtual_memory().used/(1024*1024*1024))
p2 = mp.Process(target=f2,args=(b,)) ### Please notice that the array is passed as an argument and not picked as the global variable of parent process
p2.start()
p2.join()

Results format: same as above结果格式:同上

Results:结果:

b = b+1
6.692680358886719
6.69635009765625
8.189273834228516
The second print is arising from within the function hence, by then the copy should have been made and we should see the second print to be around 8.18

b = b
6.699306488037109
6.701808929443359
6.702671051025391
The second and third print should have been around 8.18. The results suggest that no copy is created even though the array b is passed to the function as an argument

Copy-on-write does one virtual memory page at a time.写时复制一次执行一个虚拟 memory 页面。 As long as your changes are within a single 4096-byte page, you'll only pay for that one page.只要您的更改在一个 4096 字节的页面内,您只需为该页面付费。 When you modify a column, your changes are spread across many, many pages.当您修改一列时,您的更改会分布在许多页面上。 We Python programmers aren't used to worrying about the layout in physical memory, but that's the issue here.我们 Python 程序员不习惯担心物理 memory 中的布局,但这就是这里的问题。

Question 1: If you pass the dataset as a parameter, then Python has to make a copy to transfer it over.问题1:如果你将数据集作为参数传递,那么Python必须制作一个副本才能传递过来。 The parameter passing mechanism doesn't use copy-on-write, partly because the reference counting stuff would be confused.参数传递机制不使用写时复制,部分原因是引用计数的东西会被混淆。 When you create it as a global before things start, there's a solid reference, so the multiprocessing code can make copy-on-write happen.当您在事情开始之前将其创建为全局时,有一个可靠的参考,因此多处理代码可以使写时复制发生。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 在 Python 中完成多处理池后,有没有办法在所有子进程中组合全局变量? - Upon finishing a multiprocessing pool in Python, is there a way to combine global variables across all sub-processes? 用于监视流程和子流程的Python脚本 - Python script to monitor process and sub-processes Python和多处理,将集合生成细分为子流程 - Python & Multiprocessing, breaking a set generation down into sub-processes 在Python的多处理中硬杀死挂起的子进程 - Hard-kill hanging sub-processes in Python's multiprocessing python pathos - 主进程运行很慢,子进程串行运行 - python pathos - main process runs very slow and sub-processes run serially Linux中的Python:使用Shell杀死进程和子进程 - Python in Linux: kill processes and sub-processes using the shell 如何在Python中有效地将大块数据散开到多个并发子进程中? - How to efficiently fan out large chunks of data into multiple concurrent sub-processes in Python? Pyspark驱动程序中Python子进程的内存分配 - Memory Allocation of Python Sub-Processes Within Pyspark Driver 使用相同的 arguments 自动重启 Python 个子进程 - Automatically restarting Python sub-processes using identical arguments Python多重处理-共享ID的单独进程中的全局变量? - Python multiprocessing--global variables in separate processes sharing id?
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM