[英]python multiprocessing: write to same excel file
I am new to Python and I am trying to save the results of five different processes to one excel file (each process write to a different sheet).我是 Python 新手,我试图将五个不同进程的结果保存到一个 excel 文件中(每个进程写入不同的工作表)。 I have read different posts here, but still can't get it done as I'm very confused about pool.map, queues, and locks, and I'm not sure what is required here to fulfill this task.
我在这里阅读了不同的帖子,但仍然无法完成,因为我对 pool.map、队列和锁非常困惑,而且我不确定这里需要什么来完成这项任务。 This is my code so far:
到目前为止,这是我的代码:
list_of_days = ["2017.03.20", "2017.03.21", "2017.03.22", "2017.03.23", "2017.03.24"]
results = pd.DataFrame()
if __name__ == '__main__':
global list_of_days
writer = pd.ExcelWriter('myfile.xlsx', engine='xlsxwriter')
nr_of_cores = multiprocessing.cpu_count()
l = multiprocessing.Lock()
pool = multiprocessing.Pool(processes=nr_of_cores, initializer=init, initargs=(l,))
pool.map(f, range(len(list_of_days)))
pool.close()
pool.join()
def init(l):
global lock
lock = l
def f(k):
global results
*** DO SOME STUFF HERE***
results = results[ *** finished pandas dataframe *** ]
lock.acquire()
results.to_excel(writer, sheet_name=list_of_days[k])
writer.save()
lock.release()
The result is that only one sheet gets created in excel (I assume it is the process finishing last).结果是在 excel 中只创建了一张工作表(我认为这是最后完成的过程)。 Some questions about this code:
关于这段代码的一些问题:
Really appreciate some input here, as I consider mastering multiprocessing as instrumental.真的很感谢这里的一些输入,因为我认为掌握多处理是有用的。 Thanks
谢谢
1) Why did you implement time.sleep in several places in your 2nd method?
1)为什么你在你的第二种方法中的几个地方实现了 time.sleep ?
In __main__
, time.sleep(0.1)
, to give the started process
a timeslice to startup.在
__main__
, time.sleep(0.1)
,给启动的process
一个时间片来启动。
In f2(fq, q)
, to give the queue
a timeslice to flushed all buffered data to the pipe and as q.get_nowait()
are used.在
f2(fq, q)
,为queue
一个时间片以将所有缓冲数据刷新到管道,并使用q.get_nowait()
。
In w(q)
, are only for testing simulating long run of writer.to_excel(...)
, i removed this one.在
w(q)
,仅用于测试模拟writer.to_excel(...)
长期运行,我删除了这个。
2) What is the difference between pool.map and pool = [mp.Process( . )]?
2) pool.map 和 pool = [mp.Process( . )] 有什么区别?
Using pool.map
needs no Queue
, no parameter passed, shorter code.使用
pool.map
不需要Queue
,不需要传递参数,代码更短。 The worker_process
have to return immediately the result
and terminates. worker_process
必须立即返回result
并终止。 pool.map
starts a new process as long as all iteration
are done.只要完成所有
iteration
pool.map
开始一个新进程。 The results
have to be processed after that.之后必须处理
results
。
Using pool = [mp.Process( . )]
, starts n processes
.使用
pool = [mp.Process( . )]
,启动n 个processes
。 A process
terminates on queue.Empty
process
在queue.Empty
上queue.Empty
Can you think of a situation where you would prefer one method over the other?
你能想出你更喜欢一种方法而不是另一种方法的情况吗?
Methode 1: Quick setup, serialized, only interested in the result to continue.方法一:快速设置,序列化,只对结果感兴趣才继续。
Methode 2: If you want to do all workload parallel .方法 2:如果要并行处理所有工作负载。
You could't use global writer
in processes.您不能在进程中使用
global writer
编写器。
The writer
instance has to belong to one process
. writer
实例必须属于一个process
。
Usage of mp.Pool
, for instance: mp.Pool
用法,例如:
def f1(k):
# *** DO SOME STUFF HERE***
results = pd.DataFrame(df_)
return results
if __name__ == '__main__':
pool = mp.Pool()
results = pool.map(f1, range(len(list_of_days)))
writer = pd.ExcelWriter('../test/myfile.xlsx', engine='xlsxwriter')
for k, result in enumerate(results):
result.to_excel(writer, sheet_name=list_of_days[k])
writer.save()
pool.close()
This leads to .to_excel(...)
are called in sequence in the __main__
process.这导致
.to_excel(...)
在__main__
过程中被依次调用。
If you want parallel .to_excel(...)
you have to use mp.Queue()
.如果你想要并行
.to_excel(...)
你必须使用mp.Queue()
。
For instance:例如:
The worker
process: worker
进程:
# mp.Queue exeptions have to load from
try:
# Python3
import queue
except:
# Python 2
import Queue as queue
def f2(fq, q):
while True:
try:
k = fq.get_nowait()
except queue.Empty:
exit(0)
# *** DO SOME STUFF HERE***
results = pd.DataFrame(df_)
q.put( (list_of_days[k], results) )
time.sleep(0.1)
The writer
process: writer
流程:
def w(q):
writer = pd.ExcelWriter('myfile.xlsx', engine='xlsxwriter')
while True:
try:
titel, result = q.get()
except ValueError:
writer.save()
exit(0)
result.to_excel(writer, sheet_name=titel)
The __main__
process: __main__
过程:
if __name__ == '__main__':
w_q = mp.Queue()
w_p = mp.Process(target=w, args=(w_q,))
w_p.start()
time.sleep(0.1)
f_q = mp.Queue()
for i in range(len(list_of_days)):
f_q.put(i)
pool = [mp.Process(target=f2, args=(f_q, w_q,)) for p in range(os.cpu_count())]
for p in pool:
p.start()
time.sleep(0.1)
for p in pool:
p.join()
w_q.put('STOP')
w_p.join()
Tested with Python:3.4.2 - pandas:0.19.2 - xlsxwriter:0.9.6用 Python 测试:3.4.2 - pandas:0.19.2 - xlsxwriter:0.9.6
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.