Python：多处理代码非常慢

Question

I am pulling .8 million of records in one go(this is one time process) from mongodb using pymongo and performing some operation over it . 我拉.8一气呵成百万的记录（这是一个时间过程），从mongodb使用pymongo并对其执行某些操作。

My code look as below. 我的代码如下所示。

    proc = []
    for rec in cursor: # cursor has .8 million rows 
            print cnt
            cnt = cnt + 1
            url =  rec['urlk']
            mkptid = rec['mkptid']
            cii = rec['cii']

            #self.process_single_layer(url, mkptid, cii)


            proc = Process(target=self.process_single_layer, args=(url, mkptid, cii))
            procs.append(proc)
            proc.start()

             # complete the processes
    for proc in procs:
        proc.join()

process_single_layer is a function which is basically downloading urls .from cloud and storing locally. process_single_layer是一个基本上从云下载urls并本地存储的功能。

Now the problem is downloading process is slow as it has to hit a url. 现在的问题是，下载过程很慢，因为它必须命中URL。 And since records are huge to process 1k rows it is taking 6 minutes. 而且由于记录量巨大，因此要处理1k行，因此需要6分钟。

To reduce the time I wanted to implement Multiprocessing . 为了减少时间，我想实现Multiprocessing 。 But It is hard to see any difference with above code. 但是很难看到上面的代码有什么区别。

Please suggest me how can I improve the performance in this scenario. 请建议我如何在这种情况下提高性能。

Answer 1

First of all you need to count all the rows in your file and then spawn a fixed number of processes (ideally matching the number of your processor cores), to which you feed via queues (one for each process) a number of rows equal to the division total_number_of_rows / number_of_cores . 首先，您需要计算文件中的所有行，然后生成固定数量的进程（最好与处理器内核的数量匹配），并通过队列（每个进程一个）将等于total_number_of_rows / number_of_cores 。 The idea behind this approach is that you split the processing of those rows between multiple processes, hence achieving parallelism. 这种方法的思想是将这些行的处理划分为多个进程，从而实现并行性。

A way to find out the number of cores dynamically is by doing: 一种动态找出内核数的方法是：

import multiprocessing as mp
cores_count = mp.cpu_count()

A slight improvement that can be done by avoiding the initial rows count is by adding a row cyclically by creating the list of queues and then apply a cycle iterator on it. 避免初始行数可以做的一点改进是，通过创建队列列表来循环添加一行，然后在其上应用循环迭代器。

A full example: 一个完整的例子：

import queue
import multiprocessing as mp
import itertools as itools

cores_count = mp.cpu_count()


def dosomething(q):

    while True:

        try:
            row = q.get(timeout=5)
        except queue.Empty:
            break

    # ..do some processing here with the row

    pass

if __name__ == '__main__':
    processes
    queues = []

    # spawn the processes
    for i in range(cores_count):
        q = mp.Queue()
        queues.append(q)
        proc = Process(target=dosomething, args=(q,))
        processes.append(proc)

    queues_cycle = itools.cycle(queues)
    for row in cursor:
        q = next(queues_cycle)
        q.put(row)

    # do the join after spawning all the processes
    for p in processes:
        p.join()

Answer 2

It's easier to use a pool in this scenario. 在这种情况下使用池更容易。

Queues are not necessary as you don't need to communicate between your spawned processes. 队列不是必需的，因为您无需在生成的进程之间进行通信。 We can use the Pool.map to distribute the workload. 我们可以使用Pool.map分配工作量。

Pool.imap or Pool.imap_unordered might be faster with a larger chunk size. Pool.imap或Pool.imap_unordered在块大小较大时可能会更快。 (Ref: https://docs.python.org/3/library/multiprocessing.html#multiprocessing.pool.Pool.imap ) You can use the Pool.starmap if you want and get rid of tuple unpacking. （请 Pool.starmap ： https : Pool.starmap如果需要，可以使用Pool.starmap并摆脱元组拆包。

from multiprocessing import Pool

def process_single_layer(data):
    # unpack the tuple and do the processing
    url, mkptid, cii = data
    return "downloaded" + url

def get_urls():
    # replace this code: iterate over cursor and yield necessary data as a tuple
    for rec in range(8): 
            url =  "url:" + str(rec)
            mkptid = "mkptid:" + str(rec)
            cii = "cii:" + str(rec)
            yield (url, mkptid, cii)

#  you can come up with suitable process count based on the number of CPUs.
with Pool(processes=4) as pool:
    print(pool.map(process_single_layer, get_urls()))

Python：多处理代码非常慢

问题描述

2 个解决方案

解决方案1
0 2019-04-16 09:47:48

解决方案2
0 2019-04-16 10:43:18

Python：多处理代码非常慢

问题描述

2 个解决方案

解决方案1 0 2019-04-16 09:47:48

解决方案2 0 2019-04-16 10:43:18

解决方案1
0 2019-04-16 09:47:48

解决方案2
0 2019-04-16 10:43:18