Remove/retain numpy array rows based on ID value

Question

I have two numpy arrays, each with an identifying number in column 0.

Where the identifying numbers for each array match, I wish to keep the corresponding row associated with those ID numbers.

Where there is a ID that doesn't have a matching ID in the other array, I wish to delete the row associated in with that ID number, only in that array that the ID number occurs.

The arrays are both ordered by their ID numbers.

The examples of the input arrays a & b, and output arrays c & d, can be found below - note that the arrays do not have the same number of rows (nb real examples of a & b are much larger - (2487, 12) & (2482, 12) respectively)

In:

a =
[[9.60977,  97.5,  96,    99,    100.5,  1.60]
 [9.60978,  97.5,  96,    100.5, 102,    0.31]
 [9.60979,  97.5,  96,    102,   103.5,  0.11]
 [9.60980,  97.5,  96,    103.5, 105,    0.05]
 [9.60981,  97.5,  96,    105,   106.5,  0.03]
 [9.60983,  97.5,  96,    108,   109.5,  0.01]
 [9.60984,  97.5,  96,    109.5, 111,    0.01]]

b = 
[[9.60977,  99,    100.5, 97.5,  96,     1.58]
 [9.60979,  102,   103.5, 97.5,  96,     0.11]
 [9.60980,  103.5, 105,   97.5,  96,     0.05] 
 [9.60981,  105,   106.5, 97.5,  96,     0.03]
 [9.60982,  106.5, 108,   97.5,  96,     0.02]
 [9.60984,  109.5, 111,   97.5,  96,     0.01]]

Out:

c =
[[9.60977,  97.5,  96,    99,    100.5,  1.60]
 [9.60979,  97.5,  96,    102,   103.5,  0.11]
 [9.60980,  97.5,  96,    103.5, 105,    0.05]
 [9.60981,  97.5,  96,    105,   106.5,  0.03]
 [9.60984,  97.5,  96,    109.5, 111,    0.01]]

d = 
[[9.60977,  99,    100.5, 97.5,  96,     1.58]
 [9.60979,  102,   103.5, 97.5,  96,     0.11]
 [9.60980,  103.5, 105,   97.5,  96,     0.05] 
 [9.60981,  105,   106.5, 97.5,  96,     0.03]
 [9.60984,  109.5, 111,   97.5,  96,     0.01]]

I have tried using a pair of if statements sat within a for loop, but this falls down because 1) the arrays aren't the same length (see Traceback below), and 2) it doesn't retest the rows once a value has been deleted

for i in np.arange(0, max(len(a), len(b)), 1):
    if a[i, 0] > b[i, 0]:
        a = np.delete(a, i, 0)
    if a[i, 0] < b[i, 0]:
        b = np.delete(b, i, 0)

Traceback (most recent call last):

  File "<ipython-input-271-509fc93aea3b>", line 2, in <module>
    if a[i, 0] > b[i, 0]:

IndexError: index 4 is out of bounds for axis 0 with size 3

I've also tried this while loop, but it deletes all the wrong rows in array b

n = 0
s = max(len(a), len(b))
c = np.array(())
d = np.array(())
while n != s:
    if a[n, 0] == b[n, 0]:
        c = np.append(c, a[n, :])
        d = np.append(d, b[n, :])
        n = n+1
    elif a[n, 0] > b[n, 0]:
        a = np.delete(a, n, 0)
    elif a[n, 0] < b[n, 0]:
        b = np.delete(b, n, 0)
Traceback (most recent call last):

  File "<ipython-input-285-f7c600c498cb>", line 6, in <module>
    if a[n, 0] == b[n, 0]:

IndexError: index 1 is out of bounds for axis 0 with size 1

Are there any more sensible ways that I can remove and append rows using the ID numbers?

Answer 1

You can use np.isin to find where in each array the value in the first column occurs in the other array's first column value. Then, it's just a matter of simple indexing.

c = a[np.isin(a[:,0],b[:,0])]

d = b[np.isin(b[:,0],a[:,0])]

>>> c
array([[  9.60977000e+00,   9.75000000e+01,   9.60000000e+01,
          9.90000000e+01,   1.00500000e+02,   1.60000000e+00],
       [  9.60979000e+00,   9.75000000e+01,   9.60000000e+01,
          1.02000000e+02,   1.03500000e+02,   1.10000000e-01],
       [  9.60980000e+00,   9.75000000e+01,   9.60000000e+01,
          1.03500000e+02,   1.05000000e+02,   5.00000000e-02],
       [  9.60981000e+00,   9.75000000e+01,   9.60000000e+01,
          1.05000000e+02,   1.06500000e+02,   3.00000000e-02],
       [  9.60984000e+00,   9.75000000e+01,   9.60000000e+01,
          1.09500000e+02,   1.11000000e+02,   1.00000000e-02]])
>>> d
array([[  9.60977000e+00,   9.90000000e+01,   1.00500000e+02,
          9.75000000e+01,   9.60000000e+01,   1.58000000e+00],
       [  9.60979000e+00,   1.02000000e+02,   1.03500000e+02,
          9.75000000e+01,   9.60000000e+01,   1.10000000e-01],
       [  9.60980000e+00,   1.03500000e+02,   1.05000000e+02,
          9.75000000e+01,   9.60000000e+01,   5.00000000e-02],
       [  9.60981000e+00,   1.05000000e+02,   1.06500000e+02,
          9.75000000e+01,   9.60000000e+01,   3.00000000e-02],
       [  9.60984000e+00,   1.09500000e+02,   1.11000000e+02,
          9.75000000e+01,   9.60000000e+01,   1.00000000e-02]])

Explanation :

 >>> np.isin(a[:,0],b[:,0])
array([ True, False,  True,  True,  True, False,  True], dtype=bool)

The above basically just shows you where the values of the first column of a can be found in the first column of b You can then just index a by that array of booleans, using the code I showed above:

c = a[np.isin(a[:,0],b[:,0])]

Remove/retain numpy array rows based on ID value

Question

1 answers

solution1
2 ACCPTED 2018-07-25 16:33:41

Remove/retain numpy array rows based on ID value

Question

1 answers

solution1 2 ACCPTED 2018-07-25 16:33:41

solution1
2 ACCPTED 2018-07-25 16:33:41