简体   繁体   中英

Fast way to find matching elements in multiple numpy arrays

I am trying to find matching elements in a large number of large numpy arrays. The way I am doing this currently (holding the arrays in a dictionary and looping over them several times) is very slow, and am looking for a faster way to tackle the problem.

Here is a self-contained section of code that presents the problem:

import numpy as np

data = {}
for frame in xrange(100):
    data[frame] = np.random.randint(100, size=(100, 3))
# data structure is 100 'pages' each containing an array of 100 elements
# trying to find matching elements from arrays on different pages

for page in data:
    for point in data[page]:
        for page_to_match in data:
            if page_to_match == page: # exclude matches on the same page
                pass
            else:
                for point_to_match in data[page_to_match]:
                    if np.array_equal(point, point_to_match):
                        print 'Found a match -', point, '- pages:', page, page_to_match
                        # should find about 0 - 3 matches per page

As you can see it works but very poorly.

Edit: here is a minimal version of the code. It works quickly for small arrays like this but slowly for large arrays like above. Replace the first three lines in the above section with the following:

data = {}
data[0] = np.array([[1,2],[3,4]])
data[1] = np.array([[5,6],[7,8]])
data[2] = np.array([[3,4],[5,8]])

If memory isn't an issue, I would solve the problem like this:

>>> data = {}
>>> for frame in range(100):
...     data[frame] = np.random.randint(100, size=(100, 3))
...
>>> from collections import defaultdict
>>> grouper = defaultdict(list)
>>> for k, v in data.items():
...     for row in v:
...         grouper[tuple(row)].append(k)
...
>>> for k, v in grouper.items():
...     if len(v) > 1:
...         print("Duplicate row", np.array(k), "found in pages:")
...         print(*v)
...

Results:

Duplicate row [89 50 83] found in pages:
13 96
Duplicate row [76 88 29] found in pages:
32 56
Duplicate row [74 26 81] found in pages:
11 99
Duplicate row [19  4 53] found in pages:
17 44
Duplicate row [56 20  4] found in pages:
41 91
Duplicate row [92 62 50] found in pages:
6 45
Duplicate row [86  9 39] found in pages:
12 62
Duplicate row [64 47 84] found in pages:
5 51
Duplicate row [52 74 44] found in pages:
9 19
Duplicate row [60 21 39] found in pages:
14 47
Duplicate row [80 42 33] found in pages:
65 82
Duplicate row [ 4 63 85] found in pages:
8 24
Duplicate row [70 84 35] found in pages:
42 69
Duplicate row [54 14 31] found in pages:
43 47
Duplicate row [38 81 38] found in pages:
51 67
Duplicate row [55 59 10] found in pages:
29 54
Duplicate row [84 77 37] found in pages:
51 53
Duplicate row [76 27 54] found in pages:
33 39
Duplicate row [52 64 20] found in pages:
1 37
Duplicate row [65 97 45] found in pages:
61 80
Duplicate row [69 52  8] found in pages:
60 85
Duplicate row [51  2 37] found in pages:
1 52
Duplicate row [31 36 53] found in pages:
50 84
Duplicate row [24 57  1] found in pages:
34 88
Duplicate row [87 89  0] found in pages:
11 78
Duplicate row [94 38 17] found in pages:
40 89
Duplicate row [46 25 46] found in pages:
54 87
Duplicate row [34 15 14] found in pages:
11 92
Duplicate row [ 3 68  1] found in pages:
48 78
Duplicate row [ 9 17 21] found in pages:
21 62
Duplicate row [17 73 66] found in pages:
1 60
Duplicate row [42 15 24] found in pages:
39 78
Duplicate row [90  8 95] found in pages:
41 61
Duplicate row [19  0 51] found in pages:
30 43
Duplicate row [65 99 51] found in pages:
57 81
Duplicate row [ 5 79 61] found in pages:
17 80
Duplicate row [49 74 71] found in pages:
0 57
Duplicate row [77  3  3] found in pages:
18 57
Duplicate row [37 54 66] found in pages:
5 13
Duplicate row [64 64 82] found in pages:
19 23
Duplicate row [64  6 21] found in pages:
27 39
Duplicate row [39 92 82] found in pages:
8 98
Duplicate row [99 10 15] found in pages:
39 68
Duplicate row [45 16 57] found in pages:
8 53
Duplicate row [12 62 98] found in pages:
29 96
Duplicate row [78 73 56] found in pages:
3 79
Duplicate row [62 87 18] found in pages:
84 97
Duplicate row [46 72 87] found in pages:
5 10
Duplicate row [27 75 25] found in pages:
50 57
Duplicate row [92 62 38] found in pages:
0 87
Duplicate row [27 95 86] found in pages:
15 80

Note, my solution is written in Python 3. You should use d.iteritems() instead of d.items() .

There might be a better solution using numpy , but this is quick and works in linear rather than polynomial time.

Assuming you want to accelerate your code it seems to be very well suited for Cython . Pythons for loop has some overhead and depending on what you are comparing you may not need rich comparisons. This gives some great potential for acceleration. In Cython you can compile your slightly modified Python code, so your loops run as fast as native C loops. Here is a little example:

cdef int i, I
cdef int x[1000]
i, I = 0, 1000
for i in range(1000):
    x[i] = i

When compiled with Cython, first C code is generated from the Cython-Code. Then it can be executed highly accelerated in comparison to pure Python.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM