简体   繁体   中英

Efficient way to subset and combine arrays of different lengths

Given a 3 dimensional boolean data:

np.random.seed(13)
bool_data = np.random.randint(2, size=(2,3,6))

>> bool_data 
array([[[0, 0, 0, 0, 0, 0],
        [0, 1, 0, 0, 1, 0],
        [0, 0, 0, 0, 0, 1]],

       [[1, 0, 1, 1, 0, 0],
        [0, 1, 1, 1, 1, 0],
        [1, 1, 1, 0, 0, 0]]])

I wish to count the number of consecutive 1's bounded by two 0's in each row (along axis=1) and return a single array with the tally. For bool_data , this would give array([1, 1, 2, 4]) .

Due to the 3D structure of bool_data and the variable tallies for each row, I had to clumsily convert the tallies into nested lists, flatten them using itertools.chain , then back-convert the list into an array:

# count consecutive 1's bounded by two 0's
def count_consect_ones(input):
    return np.diff(np.where(input==0)[0])-1

# run tallies across all rows in bool_data
consect_ones = []
for i in range(len(bool_data)):
    for j in range(len(bool_data[i])):
        res = count_consect_ones(bool_data[i, j])
        consect_ones.append(list(res[res!=0]))

>> consect_ones
[[], [1, 1], [], [2], [4], []]

# combines nested lists
from itertools import chain
consect_ones_output = np.array(list(chain.from_iterable(consect_ones)))

>> consect_ones_output
array([1, 1, 2, 4])

Is there a more efficient or clever way for doing this?

consect_ones.append(list(res[res!=0]))

If you use .extend instead, the content of the sequence is appended directly. That saves the step to combine the nested lists afterwards:

consect_ones.extend(res[res!=0])

Furthermore, you could skip the indexing, and iterate over the dimensions directly:

consect_ones = []
for i in bool_data:
    for j in i:
        res = count_consect_ones(j)
        consect_ones.extend(res[res!=0])

We could use a trick to pad the columns with zeros and then look for ramp-up and ramp-down indices on a flattened version and finally filter out the indices corresponding to the border ones to give ourselves a vectorized solution, like so -

# Input 3D array : a
b = np.pad(a, ((0,0),(0,0),(1,1)), 'constant', constant_values=(0,0))

# Get ramp-up and ramp-down indices/ start-end indices of 1s islands
s0 = np.flatnonzero(b[...,1:]>b[...,:-1])
s1 = np.flatnonzero(b[...,1:]<b[...,:-1])

# Filter only valid ones that are not at borders
n = b.shape[2]
valid_mask = (s0%(n-1)!=0) & (s1%(n-1)!=a.shape[2])
out = (s1-s0)[valid_mask]

Explanation -

The idea with padding zeros at either ends of each row as "sentients" is that when we get one-off sliced array versions and compare, we could detect the ramp-up and ramp-down places with b[...,1:]>b[...,:-1] and b[...,1:]<b[...,:-1] respectively. Thus, we get s0 and s1 as the start and end indices for each of the islands of 1s . Now, we don't want the border ones, so we need to get their column indices traced back to the original un-padded input array, hence that bit : s0%(n-1) and s1%(n-1) . We need to remove all cases where the start of each island of 1s are at the left border and end of each island of 1s at the right side border. The starts and ends are s0 and s1 . So, we use those to check if s0 is 0 and s1 is a.shape[2] . These give us the valid ones. The island lengths are obtained with s1-s0 , so mask it with valid-mask to get our desired output.

Sample input, output -

In [151]: a
Out[151]: 
array([[[0, 0, 0, 0, 0, 0],
        [0, 1, 0, 0, 1, 0],
        [0, 0, 0, 0, 0, 1]],

       [[1, 0, 1, 1, 0, 0],
        [0, 1, 1, 1, 1, 0],
        [1, 1, 1, 0, 0, 0]]])

In [152]: out
Out[152]: array([1, 1, 2, 4])

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM