简体   繁体   English

python多处理:写入同一个excel文件

[英]python multiprocessing: write to same excel file

I am new to Python and I am trying to save the results of five different processes to one excel file (each process write to a different sheet).我是 Python 新手,我试图将五个不同进程的结果保存到一个 excel 文件中(每个进程写入不同的工作表)。 I have read different posts here, but still can't get it done as I'm very confused about pool.map, queues, and locks, and I'm not sure what is required here to fulfill this task.我在这里阅读了不同的帖子,但仍然无法完成,因为我对 pool.map、队列和锁非常困惑,而且我不确定这里需要什么来完成这项任务。 This is my code so far:到目前为止,这是我的代码:

list_of_days = ["2017.03.20", "2017.03.21", "2017.03.22", "2017.03.23", "2017.03.24"]
results = pd.DataFrame()

if __name__ == '__main__':
    global list_of_days
    writer = pd.ExcelWriter('myfile.xlsx', engine='xlsxwriter')
    nr_of_cores = multiprocessing.cpu_count()
    l = multiprocessing.Lock()
    pool = multiprocessing.Pool(processes=nr_of_cores, initializer=init, initargs=(l,))
    pool.map(f, range(len(list_of_days)))
    pool.close()
    pool.join()

def init(l):
    global lock
    lock = l

def f(k):
    global results

    *** DO SOME STUFF HERE***

    results = results[ *** finished pandas dataframe *** ]

    lock.acquire()
    results.to_excel(writer, sheet_name=list_of_days[k])
    writer.save()
    lock.release()

The result is that only one sheet gets created in excel (I assume it is the process finishing last).结果是在 excel 中只创建了一张工作表(我认为这是最后完成的过程)。 Some questions about this code:关于这段代码的一些问题:

  • How to avoid defining global variables?如何避免定义全局变量?
  • Is it even possible to pass around dataframes?甚至可以传递数据帧吗?
  • Should I move the locking to main instead?我应该将锁定移到 main 吗?

Really appreciate some input here, as I consider mastering multiprocessing as instrumental.真的很感谢这里的一些输入,因为我认为掌握多处理是有用的。 Thanks谢谢

1) Why did you implement time.sleep in several places in your 2nd method? 1)为什么你在你的第二种方法中的几个地方实现了 time.sleep ?

In __main__ , time.sleep(0.1) , to give the started process a timeslice to startup.__main__time.sleep(0.1) ,给启动的process一个时间片来启动。
In f2(fq, q) , to give the queue a timeslice to flushed all buffered data to the pipe and as q.get_nowait() are used.f2(fq, q) ,为queue一个时间片以将所有缓冲数据刷新到管道,并使用q.get_nowait()
In w(q) , are only for testing simulating long run of writer.to_excel(...) , i removed this one.w(q) ,仅用于测试模拟writer.to_excel(...)长期运行,我删除了这个。

2) What is the difference between pool.map and pool = [mp.Process( . )]? 2) pool.map 和 pool = [mp.Process( . )] 有什么区别?

Using pool.map needs no Queue , no parameter passed, shorter code.使用pool.map不需要Queue ,不需要传递参数,代码更短。 The worker_process have to return immediately the result and terminates. worker_process必须立即返回result并终止。 pool.map starts a new process as long as all iteration are done.只要完成所有iteration pool.map开始一个新进程。 The results have to be processed after that.之后必须处理results

Using pool = [mp.Process( . )] , starts n processes .使用pool = [mp.Process( . )] ,启动n 个processes A process terminates on queue.Empty processqueue.Emptyqueue.Empty

Can you think of a situation where you would prefer one method over the other?你能想出你更喜欢一种方法而不是另一种方法的情况吗?

Methode 1: Quick setup, serialized, only interested in the result to continue.方法一:快速设置,序列化,只对结果感兴趣才继续。
Methode 2: If you want to do all workload parallel .方法 2:如果要并行处理所有工作负载。


You could't use global writer in processes.不能在进程中使用global writer编写器
The writer instance has to belong to one process . writer实例必须属于一个process

Usage of mp.Pool , for instance: mp.Pool用法,例如:

def f1(k):
  # *** DO SOME STUFF HERE***
  results = pd.DataFrame(df_)
  return results

if __name__ == '__main__':
    pool = mp.Pool()
    results = pool.map(f1, range(len(list_of_days)))

    writer = pd.ExcelWriter('../test/myfile.xlsx', engine='xlsxwriter')
    for k, result in enumerate(results):
        result.to_excel(writer, sheet_name=list_of_days[k])

    writer.save()
    pool.close()

This leads to .to_excel(...) are called in sequence in the __main__ process.这导致.to_excel(...)__main__过程中被依次调用。


If you want parallel .to_excel(...) you have to use mp.Queue() .如果你想要并行.to_excel(...)你必须使用mp.Queue()
For instance:例如:

The worker process: worker进程:

# mp.Queue exeptions have to load from
try:
    # Python3
    import queue
except:
    # Python 2
    import Queue as queue

def f2(fq, q):
    while True:
        try:
            k = fq.get_nowait()
        except queue.Empty:
            exit(0)

        # *** DO SOME STUFF HERE***

        results = pd.DataFrame(df_)
        q.put( (list_of_days[k], results) )
        time.sleep(0.1)  

The writer process: writer流程:

def w(q):
    writer = pd.ExcelWriter('myfile.xlsx', engine='xlsxwriter')
    while True:
        try:
            titel, result = q.get()
        except ValueError:
            writer.save()
            exit(0)

        result.to_excel(writer, sheet_name=titel)

The __main__ process: __main__过程:

if __name__ == '__main__':
    w_q = mp.Queue()
    w_p = mp.Process(target=w, args=(w_q,))
    w_p.start()
    time.sleep(0.1)

    f_q = mp.Queue()
    for i in range(len(list_of_days)):
        f_q.put(i)

    pool = [mp.Process(target=f2, args=(f_q, w_q,)) for p in range(os.cpu_count())]
    for p in pool:
        p.start()
        time.sleep(0.1)

    for p in pool:
        p.join()

    w_q.put('STOP')
    w_p.join()

Tested with Python:3.4.2 - pandas:0.19.2 - xlsxwriter:0.9.6用 Python 测试:3.4.2 - pandas:0.19.2 - xlsxwriter:0.9.6

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM