Find the set difference between two large arrays (matrices) in Python

Question

I have two large 2-d arrays and I'd like to find their set difference taking their rows as elements. In Matlab, the code for this would be setdiff(A,B,'rows') . The arrays are large enough that the obvious looping methods I could think of take too long.

Answer 1

This should work, but is currently broken in 1.6.1 due to an unavailable mergesort for the view being created. It works in the pre-release 1.7.0 version. This should be the fastest way possible, since the views don't have to copy any memory:

>>> import numpy as np
>>> a1 = np.array([[1,2,3],[4,5,6],[7,8,9]])
>>> a2 = np.array([[4,5,6],[7,8,9],[1,1,1]])
>>> a1_rows = a1.view([('', a1.dtype)] * a1.shape[1])
>>> a2_rows = a2.view([('', a2.dtype)] * a2.shape[1])
>>> np.setdiff1d(a1_rows, a2_rows).view(a1.dtype).reshape(-1, a1.shape[1])
array([[1, 2, 3]])

You can do this in Python, but it might be slow:

>>> import numpy as np
>>> a1 = np.array([[1,2,3],[4,5,6],[7,8,9]])
>>> a2 = np.array([[4,5,6],[7,8,9],[1,1,1]])
>>> a1_rows = set(map(tuple, a1))
>>> a2_rows = set(map(tuple, a2))
>>> a1_rows.difference(a2_rows)
set([(1, 2, 3)])

Answer 2

Here is a nice alternative pure numpy solution that works for 1.6.1. It does create an intermediate array, so this may or may not be a problem for you. It also does not rely on any speedup from a sorted array or not (as setdiff probably does).

from numpy import *
# Create some sample arrays
A =random.randint(0,5,(10,3))
B =random.randint(0,5,(10,3))

As an example, this is what I got - note that there is one common element:

>>> A
array([[1, 0, 3],
       [0, 4, 2],
       [0, 3, 4],
       [4, 4, 2],
       [2, 0, 2],
       [4, 0, 0],
       [3, 2, 2],
       [4, 2, 3],
       [0, 2, 1],
       [2, 0, 2]])
>>> B
array([[4, 1, 3],
       [4, 3, 0],
       [0, 3, 3],
       [3, 0, 3],
       [3, 4, 0],
       [3, 2, 3],
       [3, 1, 2],
       [4, 1, 2],
       [0, 4, 2],
       [0, 0, 3]])

We look for when the (L1) distance between the rows is zero. This gives us a matrix, which at the points where it is zero, these are the items common to both lists:

idx = where(abs((A[:,newaxis,:] - B)).sum(axis=2)==0)

As a check:

>>> A[idx[0]]
array([[0, 4, 2]])
>>> B[idx[1]]
array([[0, 4, 2]])

Answer 3

I'm not sure what you are going for, but this will get you a boolean array of where 2 arrays are not equal, and will be numpy fast:


import numpy as np
a = np.random.randn(5, 5)
b = np.random.randn(5, 5)
a[0,0] = 10.0
b[0,0] = 10.0 
a[1,1] = 5.0
b[1,1] = 5.0
c = ~(a-b==0)
print c

[[False True True True True] [ True False True True True] [ True True True True True] [ True True True True True] [ True True True True True]]

Find the set difference between two large arrays (matrices) in Python

Question

3 answers

solution1
10 2012-08-10 14:05:48

solution2
5 2012-08-10 14:46:49

solution3
-1 2012-08-10 14:28:23

Find the set difference between two large arrays (matrices) in Python

Question

3 answers

solution1 10 2012-08-10 14:05:48

solution2 5 2012-08-10 14:46:49

solution3 -1 2012-08-10 14:28:23

solution1
10 2012-08-10 14:05:48

solution2
5 2012-08-10 14:46:49

solution3
-1 2012-08-10 14:28:23