[英]Multiprocessing passing an array of dicts through shared memory
The following code works, but it is very slow due to passing the large data sets. 以下代码有效,但由于传递了大型数据集,因此速度非常慢。 In the actual implementation, the speed it takes to create the process and send the data is almost the same as calculation time, so by the time the second process is created, the first process is almost finished with the calculation, making parallezation?
在实际实现中,创建流程和发送数据所需的速度与计算时间几乎相同,因此在创建第二个流程时,第一个流程几乎完成了计算,并行化了? pointless.
无意义。
The code is the same as in this question Multiprocessing has cutoff at 992 integers being joined as result with the suggested change working and implemented below. 代码与此问题中的代码相同。 多处理已将992个整数的截止结果与建议的更改一起工作并在下面实现。 However, I ran into the common problem as others with I assume, pickling large data taking a long time.
然而,我遇到了常见问题,因为我认为其他人需要长时间腌制大量数据。
I see answers using the multiprocessing.array to pass a shared memory array. 我看到使用multiprocessing.array传递共享内存数组的答案。 I have an array of ~4000 indexes, but each index has a dictionary with 200 key/value pairs.
我有一个~4000个索引的数组,但每个索引都有一个包含200个键/值对的字典。 The data is just read by each process, some calculation is done, and then an matrix (4000x3) (with no dicts) is returned.
每个过程只读取数据,完成一些计算,然后返回矩阵(4000x3)(没有dicts)。
Answers like this Is shared readonly data copied to different processes for Python multiprocessing? 像这样的答案共享只读数据复制到Python多处理的不同进程? use map.
使用地图。 Is it possible to maintain the below system and implement shared memory?
是否可以维护以下系统并实现共享内存? Is there an efficient way to send the data to each process with an array of dicts, such as wrapping the dict in some manager and then putting that inside of the multiprocessing.array ?
是否有一种有效的方法将数据发送到每个进程并使用一系列dicts,例如将dict包装在某个管理器中,然后将其放入multiprocessing.array中?
import multiprocessing
def main():
data = {}
total = []
for j in range(0,3000):
total.append(data)
for i in range(0,200):
data[str(i)] = i
CalcManager(total,start=0,end=3000)
def CalcManager(myData,start,end):
print 'in calc manager'
#Multi processing
#Set the number of processes to use.
nprocs = 3
#Initialize the multiprocessing queue so we can get the values returned to us
tasks = multiprocessing.JoinableQueue()
result_q = multiprocessing.Queue()
#Setup an empty array to store our processes
procs = []
#Divide up the data for the set number of processes
interval = (end-start)/nprocs
new_start = start
#Create all the processes while dividing the work appropriately
for i in range(nprocs):
print 'starting processes'
new_end = new_start + interval
#Make sure we dont go past the size of the data
if new_end > end:
new_end = end
#Generate a new process and pass it the arguments
data = myData[new_start:new_end]
#Create the processes and pass the data and the result queue
p = multiprocessing.Process(target=multiProcess,args=(data,new_start,new_end,result_q,i))
procs.append(p)
p.start()
#Increment our next start to the current end
new_start = new_end+1
print 'finished starting'
#Print out the results
for i in range(nprocs):
result = result_q.get()
print result
#Joint the process to wait for all data/process to be finished
for p in procs:
p.join()
#MultiProcess Handling
def multiProcess(data,start,end,result_q,proc_num):
print 'started process'
results = []
temp = []
for i in range(0,22):
results.append(temp)
for j in range(0,3):
temp.append(j)
result_q.put(results)
return
if __name__== '__main__':
main()
Solved 解决了
by just putting the list of dictionaries into a manager, the problem was solved. 只需将字典列表放入管理器中,问题就解决了。
manager=Manager()
d=manager.list(myData)
It seems that the manager holding the list also manages the dict contained by that list. 似乎持有列表的经理也管理该列表包含的字典。 The startup time is a bit slow, so it seems data is still being copied, but its done once at the beginning and then inside of the process the data is sliced.
启动时间有点慢,所以似乎数据仍然被复制,但它在开始时完成一次,然后在数据被切片的过程内部完成。
import multiprocessing
import multiprocessing.sharedctypes as mt
from multiprocessing import Process, Lock, Manager
from ctypes import Structure, c_double
def main():
data = {}
total = []
for j in range(0,3000):
total.append(data)
for i in range(0,100):
data[str(i)] = i
CalcManager(total,start=0,end=500)
def CalcManager(myData,start,end):
print 'in calc manager'
print type(myData[0])
manager = Manager()
d = manager.list(myData)
#Multi processing
#Set the number of processes to use.
nprocs = 3
#Initialize the multiprocessing queue so we can get the values returned to us
tasks = multiprocessing.JoinableQueue()
result_q = multiprocessing.Queue()
#Setup an empty array to store our processes
procs = []
#Divide up the data for the set number of processes
interval = (end-start)/nprocs
new_start = start
#Create all the processes while dividing the work appropriately
for i in range(nprocs):
new_end = new_start + interval
#Make sure we dont go past the size of the data
if new_end > end:
new_end = end
#Generate a new process and pass it the arguments
data = myData[new_start:new_end]
#Create the processes and pass the data and the result queue
p = multiprocessing.Process(target=multiProcess,args=(d,new_start,new_end,result_q,i))
procs.append(p)
p.start()
#Increment our next start to the current end
new_start = new_end+1
print 'finished starting'
#Print out the results
for i in range(nprocs):
result = result_q.get()
print len(result)
#Joint the process to wait for all data/process to be finished
for p in procs:
p.join()
#MultiProcess Handling
def multiProcess(data,start,end,result_q,proc_num):
#print 'started process'
results = []
temp = []
data = data[start:end]
for i in range(0,22):
results.append(temp)
for j in range(0,3):
temp.append(j)
print len(data)
result_q.put(results)
return
if __name__ == '__main__':
main()
Looking at your question, I assume the following: 看看你的问题,我假设如下:
myData
, you want to return an output (a matrix of some sort) myData
每个项目,您想要返回一个输出(某种矩阵) tasks
) probably for holding the input, but not sure how to use it tasks
),可能用于保存输入,但不确定如何使用它 import logging
import multiprocessing
def create_logger(logger_name):
''' Create a logger that log to the console '''
logger = logging.getLogger(logger_name)
logger.setLevel(logging.DEBUG)
# create console handler and set appropriate level
ch = logging.StreamHandler()
formatter = logging.Formatter("%(processName)s %(funcName)s() %(levelname)s: %(message)s")
ch.setFormatter(formatter)
logger.addHandler(ch)
return logger
def main():
global logger
logger = create_logger(__name__)
logger.info('Main started')
data = []
for i in range(0,100):
data.append({str(i):i})
CalcManager(data,start=0,end=50)
logger.info('Main ended')
def CalcManager(myData,start,end):
logger.info('CalcManager started')
#Initialize the multiprocessing queue so we can get the values returned to us
tasks = multiprocessing.JoinableQueue()
results = multiprocessing.Queue()
# Add tasks
for i in range(start, end):
tasks.put(myData[i])
# Create processes to do work
nprocs = 3
for i in range(nprocs):
logger.info('starting processes')
p = multiprocessing.Process(target=worker,args=(tasks,results))
p.daemon = True
p.start()
# Wait for tasks completion, i.e. tasks queue is empty
try:
tasks.join()
except KeyboardInterrupt:
logger.info('Cancel tasks')
# Print out the results
print 'RESULTS'
while not results.empty():
result = results.get()
print result
logger.info('CalManager ended')
def worker(tasks, results):
while True:
try:
task = tasks.get() # one row of input
task['done'] = True # simular work being done
results.put(task) # Save the result to the output queue
finally:
# JoinableQueue: for every get(), we need a task_done()
tasks.task_done()
if __name__== '__main__':
main()
logging
module as it offer a few advantages: logging
模块,因为它提供了一些优点:
CalcManager
is essentially a task manager which does the following CalcManager
本质上是一个任务管理器,它执行以下操作
tasks
tasks
worker
is where the work is done worker
是工作的地方
while True
loop) while True
循环) task_done()
so that the main process knows when all jobs are done. task_done()
以便主进程知道所有作业何时完成。 I put task_done
in the finally
clause to ensure it will run even if an error occurred during processing task_done
放在finally
子句中,以确保即使在处理过程中发生错误也会运行它 You may see some improvement by using a multiprocessing.Manager
to store your list in a manager server, and having each child process access items from the dict by pulling them from that one shared list, rather than copying slices to each child process: 您可以通过使用
multiprocessing.Manager
将列表存储在管理器服务器中,让每个子进程通过从一个共享列表中提取来访问dict中的项目,而不是将切片复制到每个子进程,从而看到一些改进:
def CalcManager(myData,start,end):
print 'in calc manager'
print type(myData[0])
manager = Manager()
d = manager.list(myData)
nprocs = 3
result_q = multiprocessing.Queue()
procs = []
interval = (end-start)/nprocs
new_start = start
for i in range(nprocs):
new_end = new_start + interval
if new_end > end:
new_end = end
p = multiprocessing.Process(target=multiProcess,
args=(d, new_start, new_end, result_q, i))
procs.append(p)
p.start()
#Increment our next start to the current end
new_start = new_end+1
print 'finished starting'
for i in range(nprocs):
result = result_q.get()
print len(result)
#Joint the process to wait for all data/process to be finished
for p in procs:
p.join()
This copies your entire data
list to a Manager
process prior to creating any of your workers. 这会在创建任何工作人员之前将整个
data
列表复制到Manager
进程。 The Manager
returns a Proxy
object that allows shared access to the list
. Manager
返回一个允许对list
进行共享访问的Proxy
对象。 You then just pass the Proxy
to the workers, which means their startup time will be greatly reduced, since there's no longer any need to copy slices of the data
list. 然后,您只需将
Proxy
传递给工作人员,这意味着他们的启动时间将大大减少,因为不再需要复制data
列表的片段。 The downside here is that accessing the list will be slower in the children, since the access needs to go to the manager process via IPC. 这里的缺点是访问列表的孩子会更慢,因为访问需要通过IPC进入管理器进程。 Whether or not this will really help performance is very dependent on exactly what work you're doing on the
list
in your work processes, but its worth a try, since it requires very few code changes. 这是否真的有助于提高性能在很大程度上取决于您在工作流程
list
中正在做什么工作,但值得一试,因为它只需要很少的代码更改。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.