Slow Row-wise Comparison with For-loops in NumPy - How to improve?

Question

I'm using python and numpy to compare two arrays or equal shape with coordinates (x,y,z) in order to match them, which look like that:

coordsCFS
array([[ 0.02      ,  0.02      ,  0.        ],
       [ 0.03      ,  0.02      ,  0.        ],
       [ 0.02      ,  0.025     ,  0.        ],
        ..., 
       [ 0.02958333,  0.029375  ,  0.        ],
       [ 0.02958333,  0.0290625 ,  0.        ],
       [ 0.02958333,  0.0296875 ,  0.        ]])

and

coordsRMED
array([[ 0.02      ,  0.02      ,  0.        ],
       [ 0.02083333,  0.02      ,  0.        ],
       [ 0.02083333,  0.020625  ,  0.        ],
       ..., 
       [ 0.03      ,  0.0296875 ,  0.        ],
       [ 0.02958333,  0.03      ,  0.        ],
       [ 0.02958333,  0.0296875 ,  0.        ]])

The data are read from two hdf5 files with h5py. For the comparison I use allclose , which tests for "almost equality". The coordinates do not match within python's regular floating point precision. This is the reason I used the for loops, otherwise it would have worked with numpy.where . I usually try to avoid for loops, but in this context I couldn't figure out how. So I came up with this surprisingly slow snippet:

mapList = []
for cfsXYZ in coordsCFS:
    # print cfsXYZ
    indexMatch = 0
    match = []
    for asterXYZ in coordRMED:
        if numpy.allclose(asterXYZ,cfsXYZ):
            match.append(indexMatch)
            # print "Found match at index " + str(indexMatch)
            # print asterXYZ
        indexMatch += 1

    # check: must only find one match. 
    if len(match) != 1:
        print "ERROR matching"
        print match
        print cfsXYZ
        return 1

    # save to list
    mapList.append(match[0])

if len(mapList) != coordsRMED.shape[0]:
    print "ERROR: matching consistency check"
    print mapList
    return 1

This is very slow for my test sample size (800 rows). I plan to compare much larger sets. I could remove the consistency check and use break in the inner for loop for some speed benefit. Is there still a better way?

Answer 1

One solution is to sort both arrays (adding an index column so that the sorted arrays still contains the original indices). Then, to match, step through the arrays in lock-step. Since you're expecting a precise 1-1 correspondence, you should always be able to match pairs of rows off.

Answer 2

A first thing to remember is that by default, in NumPy, "the iteration always proceeds in a C-style contiguous fashion (last index varying the fastest)"[1]. You might improve things by reversing the order of iteration (iterate on coordMED.T , the transpose of coordMED ...)

Nevertheless, I'm still surprised by you need for a loop: you claim that 'The coordinates do not match within python's regular floating point precision': have you tried to adjust the rtol and atol parameters of np.allclose , as described in its doc ?

[1]

Answer 3

You could get rid of the inner loop with something like this:

for cfsXYZ in coordsCFS:
    match = numpy.nonzero(
        numpy.max(numpy.abs(coordRMED - cfsXYZ), axis=1) < TOLERANCE)

Slow Row-wise Comparison with For-loops in NumPy - How to improve?

Question

3 answers

solution1
1

solution2
1 2012-10-05 08:15:34

solution3
1 ACCPTED 2012-10-05 08:39:35

Slow Row-wise Comparison with For-loops in NumPy - How to improve?

Question

3 answers

solution1 1

solution2 1 2012-10-05 08:15:34

solution3 1 ACCPTED 2012-10-05 08:39:35

solution1
1

solution2
1 2012-10-05 08:15:34

solution3
1 ACCPTED 2012-10-05 08:39:35