简体   繁体   中英

How to parallelize over list elements in Python

I have a big list of dictionaries. I would like to compute from this list a bidimensional matrix, where each element is a function of the i-th element in the list and the j-th element in the list. The code is something like this:

    matrix=np.array([])
    biglist=self.some_method()
    matrix.resize(len(biglist),len(biglist))

    for i in range(len(biglist)):
        for j in range(i,len(biglist)):
            matrix[i,j]=self.__computeScore(biglist[i], biglist[j], i, j)[2]
            matrix[j,i]=matrix[i,j]

and the __computeScore method for now is something very simple:

def __computeScore(self, dictionary1, dictionary2, i, j):
    #value=in future some computation over dictionary1 and dictionary2
    value=1 #for now is so
    return (i,j,value)

even if the method that computes the score is very simple now it takes a while to compute the matrix. I would like to parallelize the matrix computation. What's the best way to proceed? So far I've already tried to use apply_async and Pool.map from the multiprocessing module, but the computation took even more time than the original code. I tried something like:

    pool = multiprocessing.Pool(processes=4)
    params=[]
    print "argument creation" #this takes a while so it's not convinient
    for i in range(len(biglist)):
        for j in range(i,len(biglist)):
            params.append((biglist[i],biglist[j],i,j))

    result=pool.map(self.__computeScore, params) #takes more time than the original code

and I also tried something like:

    def my_callback( result ):
        matrix[result[0],result[1]]=result[2]

    pool = multiprocessing.Pool(processes=4)
    for i in range(len(biglist)):
        rcopy=dict(biglist[i])
        for j in range(i,len(biglist)):
            ccopy=dict(biglist[j])
            pool.apply_async(self.__computeScore, args=(rcopy, ccopy, i, j), callback = my_callback)
    pool.close()
    pool.join()

but it takes more time than the original code. Where am I wrong? Thank You

Where am I wrong?

In the assumption that parallizing among multiple processes on the level of matrix[i,j]=self.__computeScore(...) will get you a significant performance gain.

Also, in the assumption that a minor modification of your code will lead to a significant speed up. This is not necessarily the case with "math" problems like yours. This usually requires you to re-structure your algorithm, involving more efficient data structures .

Why you have been disappointed

In fact, what you have been observing is that spawning processes and communicating tasks to them comes with a tremendous overhead compared to keeping things within one process or even thread. This overhead only pays off when communicating the task takes much less time compared to the computing time the task requires. Your multiprocessing-based approach spends most of its time in inter-process communication, and __computeScore(self, dictionary1, dictionary2, i, j) most likely just bores the child process. A counter example: say you give a process a file name and this process then spends 15 seconds working on reading data from that file and processing that data. Transmitting the file name is a matter of microseconds. Hence, in this case, the time it takes to "define" the work is negligible over the time it takes to perform the actual work. This is a scenario where multiprocessing shines. It's actually pretty simple, especially once one understands how multiprocessing works under the hood. However, the misconception you have here is a very common one, unfortunately.

Your case is entirely different. You need to know how computers work to understand how your application case can be optimized. First of all, you should prevent copying around massive amounts of data within RAM, especially between process spaces! The big data you are operating on should be stored once memory, in an efficient data structure (that is, in a data structure that allows for performing all subsequent operations in an efficient manner). Then you need to appreciate that Python itself is not a fast language. The solution is not to use a faster high-level language, but to "externalize" performance-critical code to compiled code as much as possible -- the doubly nested loops you have are a strong indicator of that. This looping should be performed within a compiled routine. The fact that you treat your problem already as a linear algebra problem loudly demands for a more consequent use of a linear algebra package (you already use numpy , but not consequently). I see that you needed the looping because a lot of your "base" data is in lists or dictionaries. The way I see it, you need to find a way to get away from these lists/dicts, and define all data you have in terms of numpy data types. That way, you may find a simple way to make the O(N^2) operation be performed by machine code, instead of by your Python interpreter.

Conclusion -- how you should approach this

You do not need multiple processes. You need the right approach within one process (thread, even). You need:

  1. more efficient data structures
  2. a linear algebra based algorithm for your calculation
  3. highly optimized math libraries for executing this algorithm

If you approach this the right way, I can assure you that your code runs orders of magnitudes faster. Because numpy/scipy under the hood make use of vectorization-based CPU features (SIMD, "single instruction, multiple data") which will run your data operation as fast as hell. And, if your matrix is large enough so that your algorithm in theory can take advantage of using multiple CPU cores: there even is OpenMP support in numpy and scipy by which certain matrix-operating routines can automatically distribute their work on multiple threads .

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM