简体   繁体   中英

Removing rows from a multi dimensional numpy array

I have a rather big 3 dimensional numpy (2000,2500,32) array that I need to manipulate.Some rows are bad so I would need to delete several rows. In order to detect which row is "bad" I using the following function

def badDetect(x):
  for i in xrange(10,19):
    ptp = np.ptp(x[i*100:(i+1)*100])
    if ptp < 0.01:
      return True
  return False

which marks as bad any sequence of 2000 that has a range of 100 values with peak to peak value less than 0.01. When this is the case I want to remove that sequence of 2000 values (which can be selected from numpy with a[:,x,y]) Numpy delete seems to be accepting indexes but only for 2 dimensional arrays.

You will definitely have to reshape your input array, because cutting out "rows" from a 3D cube leaves a structure that cannot be properly addressed.

As we don't have your data, I'll use a different example first to explain how this possible solution works:

>>> import numpy as np
>>> from numpy.lib.stride_tricks import as_strided
>>> 
>>> threshold = 18
>>> a = np.arange(5*3*2).reshape(5,3,2)  # your dataset of 2000x2500x32
>>> # Taint the data:
... a[0,0,0] = 5
>>> a[a==22]=20
>>> print(a)
[[[ 5  1]
  [ 2  3]
  [ 4  5]]

 [[ 6  7]
  [ 8  9]
  [10 11]]

 [[12 13]
  [14 15]
  [16 17]]

 [[18 19]
  [20 21]
  [20 23]]

 [[24 25]
  [26 27]
  [28 29]]]
>>> a2 = a.reshape(-1, np.prod(a.shape[1:]))
>>> print(a2)  # Will prove to be much easier to work with!
[[ 5  1  2  3  4  5]
 [ 6  7  8  9 10 11]
 [12 13 14 15 16 17]
 [18 19 20 21 20 23]
 [24 25 26 27 28 29]]

As you can see, from the representation above, it already becomes much clearer now over which windows you want to compute the peak to peak value. And you'll need this form if you're going to remove "rows" (now they have been transformed to columns) from this datastructure, something you couldn't do in 3 dimensions!

>>> isize = a.itemsize  # More generic, in case you have another dtype
>>> slice_size = 4  # How big each continuous slice is over which the Peak2Peak value is calculated
>>> slices = as_strided(a2,
...     shape=(a2.shape[0] + 1 - slice_size, slice_size, a2.shape[1]),
...     strides=(isize*a2.shape[1], isize*a2.shape[1], isize))
>>> print(slices)
[[[ 5  1  2  3  4  5]
  [ 6  7  8  9 10 11]
  [12 13 14 15 16 17]
  [18 19 20 21 20 23]]

 [[ 6  7  8  9 10 11]
  [12 13 14 15 16 17]
  [18 19 20 21 20 23]
  [24 25 26 27 28 29]]]

So I took, as an example, a window size of 4 elements: If the peak to peak value within any of these 4 element slices (per dataset, so per column) is less than a certain threshold, I want to exclude it. That can be done like this:

>>> mask = np.all(slices.ptp(axis=1) >= threshold, axis=0) # These are the ones that are of interest
>>> print(a2[:,mask])
[[ 1  2  3  5]
 [ 7  8  9 11]
 [13 14 15 17]
 [19 20 21 23]
 [25 26 27 29]]

You can now clearly see that the tainted data has been removed. But remember, you could not have simply removed that data from a 3D array (but you could've masked it then).

Obviously, you'll have to set the threshold to .01 in your use-case, and the slice_size to 100 .

Beware, while the as_strided form is extremely memory-efficient, computing the peak to peak values of this array and storing that result does require a good amount of memory in your case: 1901x(2500x32) in the full case scenario, so when you do not ignore the first 1000 slices. In your case, where you're only interested in the slices from 1000:1900 , you would have to add that to the code like so:

mask = np.all(slices[1000:1900,:,:].ptp(axis=1) >= threshold, axis=0)

And that would reduce the memory required to store this mask to "only" 900x(2500x32) values (of whatever data type you were using).

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM