I've got a numpy array containing labels. I'd like to get calculate a number for each label based on its size and bounding box. How can I write this more efficiently so that it's realistic to use on large arrays (~15000 labels)?
A = array([[ 1, 1, 0, 3, 3],
[ 1, 1, 0, 0, 0],
[ 1, 0, 0, 2, 2],
[ 1, 0, 2, 2, 2]] )
B = zeros( 4 )
for label in range(1, 4):
# get the bounding box of the label
label_points = argwhere( A == label )
(y0, x0), (y1, x1) = label_points.min(0), label_points.max(0) + 1
# assume I've computed the size of each label in a numpy array size_A
B[ label ] = myfunc(y0, x0, y1, x1, size_A[label])
I wasn't really able to implement this efficiently using some NumPy vectorised functions, so maybe a clever Python implementation will be faster.
def first_row(a, labels):
d = {}
d_setdefault = d.setdefault
len_ = len
num_labels = len_(labels)
for i, row in enumerate(a):
for label in row:
d_setdefault(label, i)
if len_(d) == num_labels:
break
return d
This function returns a dictionary mapping each label to the index of the first row it appears in. Applying the function to A
, AT
, A[::-1]
and AT[::-1]
also gives you the first column as well as the last row and column.
If you would rather like a list instead of a dictionary, you can turn the dictionary into a list using map(d.get, labels)
. Alternatively, you can use a NumPy array instead of a dictionary right from the start, but you will lose the ability to leave the loop early as soon as all labels were found.
I'd be interested whether (and how much) this actually speeds up your code, but I'm confident that it is faster than your original solution.
Algorithm:
for large array such as (7000, 9000), is can finished the calculation in 30s.
here is the code:
import numpy as np
A = np.array([[ 1, 1, 0, 3, 3],
[ 1, 1, 0, 0, 0],
[ 1, 0, 0, 2, 2],
[ 1, 0, 2, 2, 2]] )
def label_range(A):
from itertools import izip_longest
h, w = A.shape
tmp = A.reshape(-1)
index = np.argsort(tmp)
sorted_A = tmp[index]
pos = np.where(np.diff(sorted_A))[0]+1
for p1,p2 in izip_longest(pos,pos[1:]):
label_index = index[p1:p2]
y = label_index // w
x = label_index % w
x0 = np.min(x)
x1 = np.max(x)+1
y0 = np.min(y)
y1 = np.max(y)+1
label = tmp[label_index[0]]
yield label,x0,y0,x1,y1
for label,x0,y0,x1,y1 in label_range(A):
print "%d:(%d,%d)-(%d,%d)" % (label, x0,y0,x1,y1)
#B = np.random.randint(0, 100, (7000, 9000))
#list(label_range(B))
Another method:
use bincount() to get labels count in every row and column, and save the information in rows and cols array.
For each label you only need to search the range in rows and columns. It's faster than sort, on my pc, it can do the calculation in a few seconds.
def label_range2(A):
maxlabel = np.max(A)+1
h, w = A.shape
rows = np.zeros((h, maxlabel), np.bool)
for row in xrange(h):
rows[row,:] = np.bincount(A[row,:], minlength=maxlabel) > 0
cols = np.zeros((w, maxlabel), np.bool)
for col in xrange(w):
cols[col,:] =np.bincount(A[:,col], minlength=maxlabel) > 0
for label in xrange(1, maxlabel):
row = rows[:, label]
col = cols[:, label]
y = np.where(row)[0]
x = np.where(col)[0]
x0 = np.min(x)
x1 = np.max(x)+1
y0 = np.min(y)
y1 = np.max(y)+1
yield label, x0,y0,x1,y1
The performace bottleneck seems indeed to be the call to argmax
. It can be avoided by changing the loop as follows (only computing y0, y1, but easy to generalize to x0, x1):
for label in range(1, 4):
comp = (A == label)
yminind = comp.argmax(0)
ymin = comp.max(0)
ymaxind = comp.shape[0] - comp[::-1].argmax(0)
y0 = yminind[ymin].min()
y1 = ymaxind[ymin].max()
I'm not sure about the reason for the performance difference, but one reason might be that all operations like ==
, argmax
, and max
can preallocate their output array directly from the shape of the input array, which is not possible for argwhere
.
Using PyPy you can just run the loop and not worry about the vectorization. It should be fast.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.