简体   繁体   中英

Vectorizing nested for loops in list comprehension

I have two lists for strings where I'm calculating the Damerau–Levenshtein distance to check which are similar. The issue that I have those list are over 200k+ and with comprehension it takes quite a lot of time. For the distance computation I'm using pyxDamerauLevenshtein package which is written in Cython so there should be no bottleneck

series = ([damerau_levenshtein_distance(i, j) for i in original_string for j in compare_string])

That's how my code looks like and I wonder if it can be vectorized somehow to boost performance, or maybe someother way to speed up the computation?

What is my dataset:

Original string - it is pd.Series of unique street names

Compare string - this is pd.Series of manually inputed street names that I want to compare to find similarity

Output should be like that:

   Original    Compare   Distance
0  Street1     Street1      1
1  Street2     Street1      2
2  Street3     Street1      3
3  Street4     Street3      5
4  Street5     Street3      5
5  Street6     Street6      1

If you can think of a way to use map (or imap) functions rather than nested loops, you could then try using multiprocessing to fully utilise your CPU. For example, in this case:

pool.map(lambda j: map(lambda i: damerau_levenshtein_distance(i, j),original_string),compare_string) 

where 'pool.map' is the multiprocessing map, and the second 'map' is regular.

Below is a quick, but functional example of multiprocessing which could cover what you are looking for. I structured it a bit differently to avoid some pickling problems and to get it to compute in the background somewhat asynchronously since your lists are long...
(This can definitely be improved, but should hopefully serve as a proof-of-concept for your example)

import multiprocessing as mp
import itertools

list1 = range(5)
list2 = range(5)

def doSomething(a,b):
    return a+b #Your damerau_levenshtein_distance function goes here

def mapDoSomething(args):
    i = args[0] #An element of list2
    otherlist = args[1] #A copy of list1
    return [doSomething(i,j) for j in otherlist]

if __name__ == '__main__':
    pool = mp.Pool()
    answer = pool.imap(mapDoSomething,zip(list2,itertools.repeat(list1)))
    print(list(answer)) #imap will compute the results in the background whilst the rest of the code runs. You can therefore ask for individual lists of results, and it won't block unless the result hasn't been computed yet. To do this, you would use answer.next() or iterate over the results somewhere else. However, by converting to a list here, I'm forcing all results to finish before printing. This is only to show you it worked. For larger lists, don't do this.
    pool.close()
    pool.join()

This code produces:

[[0, 1, 2, 3, 4], [1, 2, 3, 4, 5], [2, 3, 4, 5, 6], [3, 4, 5, 6, 7], [4, 5, 6, 7, 8]]

which is each element of list1 operated with (I added them) each element of list2, which I think is what you've attmpted to do in your code with lists of strings.

The code sets up the Process Pool, then uses imap to split the processing of operating on list2 across multiple processes. The zip function lazily groups the element of list2 with a full copy of list1, since imap only supports functions with single arguments. This group is then split up and used in mapDoSomething, which runs the doSomething function on each element in list1 with each element of list2.

Since I've used imap, the lists get printed as soon as they are computed, rather than waiting for the entire result to be finished.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM