简体   繁体   中英

Slow Row-wise Comparison with For-loops in NumPy - How to improve?

I'm using python and numpy to compare two arrays or equal shape with coordinates (x,y,z) in order to match them, which look like that:

coordsCFS
array([[ 0.02      ,  0.02      ,  0.        ],
       [ 0.03      ,  0.02      ,  0.        ],
       [ 0.02      ,  0.025     ,  0.        ],
        ..., 
       [ 0.02958333,  0.029375  ,  0.        ],
       [ 0.02958333,  0.0290625 ,  0.        ],
       [ 0.02958333,  0.0296875 ,  0.        ]])

and

coordsRMED
array([[ 0.02      ,  0.02      ,  0.        ],
       [ 0.02083333,  0.02      ,  0.        ],
       [ 0.02083333,  0.020625  ,  0.        ],
       ..., 
       [ 0.03      ,  0.0296875 ,  0.        ],
       [ 0.02958333,  0.03      ,  0.        ],
       [ 0.02958333,  0.0296875 ,  0.        ]]) 

The data are read from two hdf5 files with h5py. For the comparison I use allclose , which tests for "almost equality". The coordinates do not match within python's regular floating point precision. This is the reason I used the for loops, otherwise it would have worked with numpy.where . I usually try to avoid for loops, but in this context I couldn't figure out how. So I came up with this surprisingly slow snippet:

mapList = []
for cfsXYZ in coordsCFS:
    # print cfsXYZ
    indexMatch = 0
    match = []
    for asterXYZ in coordRMED:
        if numpy.allclose(asterXYZ,cfsXYZ):
            match.append(indexMatch)
            # print "Found match at index " + str(indexMatch)
            # print asterXYZ
        indexMatch += 1

    # check: must only find one match. 
    if len(match) != 1:
        print "ERROR matching"
        print match
        print cfsXYZ
        return 1

    # save to list
    mapList.append(match[0])

if len(mapList) != coordsRMED.shape[0]:
    print "ERROR: matching consistency check"
    print mapList
    return 1

This is very slow for my test sample size (800 rows). I plan to compare much larger sets. I could remove the consistency check and use break in the inner for loop for some speed benefit. Is there still a better way?

One solution is to sort both arrays (adding an index column so that the sorted arrays still contains the original indices). Then, to match, step through the arrays in lock-step. Since you're expecting a precise 1-1 correspondence, you should always be able to match pairs of rows off.

A first thing to remember is that by default, in NumPy, "the iteration always proceeds in a C-style contiguous fashion (last index varying the fastest)"[1]. You might improve things by reversing the order of iteration (iterate on coordMED.T , the transpose of coordMED ...)

Nevertheless, I'm still surprised by you need for a loop: you claim that 'The coordinates do not match within python's regular floating point precision': have you tried to adjust the rtol and atol parameters of np.allclose , as described in its doc ?

[1]

You could get rid of the inner loop with something like this:

for cfsXYZ in coordsCFS:
    match = numpy.nonzero(
        numpy.max(numpy.abs(coordRMED - cfsXYZ), axis=1) < TOLERANCE)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM