简体   繁体   English

加速 python 中多处理代码的最佳方法?

[英]best way to speed up multiprocessing code in python?

I am trying to mess around with matrices in python, and wanted to use multiprocessing to processes each row separately for a math operation, I have posted a minimal reproducible sample below, but keep in mind that for my actual code I do in-fact need the entire matrix passed to the helper function. This sample takes literally forever to process a 10,000 by 10,000 matrix.我试图弄乱 python 中的矩阵,并希望使用多处理来单独处理每一行以进行数学运算,我在下面发布了一个最小的可重现示例,但请记住,对于我的实际代码,我确实需要整个矩阵传递给助手 function。这个示例几乎永远处理一个 10,000 x 10,000 的矩阵。 Almost 2 hours with 9 processes. 9 个过程将近 2 小时。 Looking in task manage it seems only 4-5 of the threads will run at any given time on my cpu, and the application never uses more than 25%.查看任务管理器,在我的 CPU 上似乎只有 4-5 个线程在任何给定时间运行,并且应用程序使用的线程从不超过 25%。 I've done my absolute best to avoid branches in my real code, though the sample provided is branchless.尽管提供的示例是无分支的,但我已尽最大努力避免在我的真实代码中出现分支。 It still takes roughly 25 seconds to process a 1000 by 1000 matrix on my machine, which is ludacris to me as a mainly c++ developer.在我的机器上处理一个 1000 x 1000 的矩阵仍然需要大约 25 秒,这对我作为一个主要的 c++ 开发人员来说是 ludacris。 I wrote serial code in C that executes the entire 10,000 by 10,000 in constant time in less than a second.我在 C 中编写了串行代码,可以在不到一秒的时间内以恒定时间执行整个 10,000 x 10,000。 I think the main bottleneck is the multiprocessing code, but I am required to do this with multiprocessing.我认为主要瓶颈是多处理代码,但我需要用多处理来做到这一点。 Any ideas for how I could go about improving this?关于如何改进它的任何想法 go? Each row can be processed entirely separately but they need to be joined together back into a matrix for my actual code.每行都可以完全单独处理,但需要将它们重新组合成一个矩阵以用于我的实际代码。

import random
from multiprocessing import Pool
import time


def addMatrixRow(matrixData):
    matrix = matrixData[0]
    rowNum = matrixData[1]
    del (matrixData)
    rowSum = 0
    for colNum in range(len(matrix[rowNum])):
        rowSum += matrix[rowNum][colNum]

    return rowSum


def genMatrix(row, col):
    matrix = list()
    for i in range(row):
        matrix.append(list())
        for j in range(col):
            matrix[i].append(random.randint(0, 1))
    return matrix

def main():
    matrix = genMatrix(1000, 1000)
    print("generated matrix")
    MAX_PROCESSES = 4
    finalSum = 0

    processPool = Pool(processes=MAX_PROCESSES)
    poolData = list()

    start = time.time()
    for i in range(100):
        for rowNum in range(len(matrix)):
            matrixData = [matrix, rowNum]
            poolData.append(matrixData)

        finalData = processPool.map(addMatrixRow, poolData)
        poolData = list()
        finalSum += sum(finalData)
    end = time.time()
    print(end-start)
    print(f'final sum {finalSum}')


if __name__ == '__main__':
    main()

Your matrix has 1000 rows of 1000 elements each and you are summing each row 100 times.您的matrix有 1000 行,每行 1000 个元素,您对每行求和 100 次。 By my calculation, that is 100,000 tasks you are submitting to the pool passing a one-million element matrix each time.根据我的计算,这是您提交给池的 100,000 个任务,每次都通过一百万个元素矩阵。 Ouch!哎哟!

Now I know you say that the worker function addMatrixRow must have access to the complete matrix.现在我知道你说 worker function addMatrixRow必须有权访问完整的矩阵。 Fine.美好的。 But instead of passing it a 100,000 times, you can reduce that to 4 times by initializing each process in the pool with a global variable set to the matrix using the initializer and initargs arguments when you construct the pool.但是,与其将它传递 100,000 次,不如在构建池时使用初始化程序initargs arguments 初始化池中的每个进程,并将全局变量设置为矩阵,从而将其减少到 4 次。 You are able to get away with this because the matrix is read-only.你可以摆脱这个,因为矩阵是只读的。

And instead of creating poolArgs as a large list you can instead create a generator function that when iterated returns the next argument to be submitted to the pool.而不是将poolArgs创建为一个大列表,您可以创建一个生成器 function,它在迭代时返回下一个要提交给池的参数。 But to take advantage of this you cannot use the map method, which will convert the generator to a list and not save you any memory. Instead use imap_unordered (rather than imap since you do not care now in what order your worker function is returning its results because of the commutative law of addition).但是要利用这一点,您不能使用map方法,该方法会将生成器转换为列表并且不会为您保存任何 memory。而是使用imap_unordered (而不是imap ,因为您现在不关心您的工作人员 function 以什么顺序返回其结果是因为加法的交换律)。 But with such a large input, you should be using the chunksize argument with imap_unordered .但是对于如此大的输入,您应该将chunksize参数与imap_unordered一起使用。 So that the number of reads and writes to the pool's task queue is greatly reduced(albeit the size of the data being written is larger for each queue operation).这样就大大减少了对池的任务队列的读写次数(尽管每次队列操作写入的数据量更大)。

If all of this is somewhat vague to you, I suggest reading the docs thoroughly for class multiprocessing.pool.Pool and its imap and imap_unordered methods.如果所有这些对您来说都有些模糊,我建议您通读 class multiprocessing.pool.Pool及其imapimap_unordered方法的文档。

I have made a few other optimizations replacing for loops with list comprehensions and using the built-in sum function.我做了一些其他优化for用列表推导替换循环并使用内置sum function。

import random
from multiprocessing import Pool
import time


def init_pool_processes(m):
    global matrix
    matrix = m 

def addMatrixRow(rowNum):
    return sum(matrix[rowNum])

def genMatrix(row, col):
    return [[random.randint(0, 1) for _ in range(col)] for _ in range(row)]
   
def compute_chunksize(pool_size, iterable_size):
    chunksize, remainder = divmod(iterable_size, 4 * pool_size)
    if remainder:
        chunksize += 1
    return chunksize

def main():
    matrix = genMatrix(1000, 1000)
    print("generated matrix")
    MAX_PROCESSES = 4

    processPool = Pool(processes=MAX_PROCESSES, initializer=init_pool_processes, initargs=(matrix,))
    start = time.time()
    # Use a generator function:
    poolData = (rowNum for _ in range(100) for rowNum in range(len(matrix)))
    # Compute efficient chunksize
    chunksize = compute_chunksize(MAX_PROCESSES, len(matrix) * 100)
    finalSum = sum(processPool.imap_unordered(addMatrixRow, poolData, chunksize=chunksize))
    end = time.time()
    print(end-start)
    print(f'final sum {finalSum}')
    processPool.close()
    processPool.join()


if __name__ == '__main__':
    main()

Prints:印刷:

generated matrix
0.35799622535705566
final sum 49945400

Note the running time of.36 seconds.注意 0.36 秒的运行时间。

Assuming you have more CPU cores (than 4), use them all for an even greater reduction in time.假设您有更多的 CPU 内核(多于 4 个),将它们全部使用以进一步缩短时间。

you are serializing the entire matrix on each function call, you should only send the data that you are processing to the function, nothing more... and python has a built-in sum function that has a very optimized C code.您在每次 function 调用时序列化整个矩阵,您应该只将您正在处理的数据发送到 function,仅此而已...... python 有一个内置的sum function,它有一个非常优化的 C 代码。

import random
from multiprocessing import Pool
import time


def addMatrixRow(row_data):
    rowSum = sum(row_data)
    return rowSum


def genMatrix(row, col):
    matrix = list()
    for i in range(row):
        matrix.append(list())
        for j in range(col):
            matrix[i].append(random.randint(0, 1))
    return matrix

def main():
    matrix = genMatrix(1000, 1000)
    print("generated matrix")
    MAX_PROCESSES = 4
    finalSum = 0

    processPool = Pool(processes=MAX_PROCESSES)
    poolData = list()

    start = time.time()
    for i in range(100):
        for rowNum in range(len(matrix)):
            matrixData = matrix[rowNum]
            poolData.append(matrixData)

        finalData = processPool.map(addMatrixRow, poolData)
        poolData = list()
        finalSum += sum(finalData)
    end = time.time()
    print(end-start)
    print(f'final sum {finalSum}')


if __name__ == '__main__':
    main()
generated matrix
3.5028157234191895
final sum 49963400

just not using process pool and running the code serially using list(map(sum,poolData))只是不使用process pool并使用list(map(sum,poolData))串行运行代码

generated matrix
1.2143816947937012
final sum 50020800

so yeh python can do it in a second.所以叶 python 可以在一秒钟内完成。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM