Efficiently select elements from an (x,y) field with a 2D mask in Python

Question

I have a large field of 2D-position data, given as two arrays x and y , where len(x) == len(y) . I would like to return the array of indices idx_masked at which (x[idx_masked], y[idx_masked]) is masked by an N x N int array called mask . That is, mask[x[idx_masked], y[idx_masked]] == 1 . The mask array consists of 0 s and 1 s only.

I have come up with the following solution, but it (specifically, the last line below) is very slow, given that I have N x N = 5000 x 5000, repeated 1000s of times:

import numpy as np
import matplotlib.pyplot as plt

# example mask of one corner of a square
N = 100
mask = np.zeros((N, N))
mask[0:10, 0:10] = 1

# example x and y position arrays in arbitrary units
x = np.random.uniform(0, 1, 1000)
y = np.random.uniform(0, 1, 1000)

x_bins = np.linspace(np.min(x), np.max(x), N)
y_bins = np.linspace(np.min(y), np.max(y), N)

x_bin_idx = np.digitize(x, x_bins)
y_bin_idx = np.digitize(y, y_bins)

idx_masked = np.ravel(np.where(mask[y_bin_idx - 1, x_bin_idx - 1] == 1))

plt.imshow(mask[::-1, :])

plt.scatter(x, y, color='red')
plt.scatter(x[idx_masked], y[idx_masked], color='blue')

Is there a more efficient way of doing this?

Answer 1

Given that mask overlays your field with identically-sized bins, you do not need to define the bins explicitly. *_bin_idx can be determined at each location from a simple floor division, since you know that each bin is 1 / N in size. I would recommend using 1 - 0 for the total width (what you passed into np.random.uniform ) instead of x.max() - x.min() , if of course you know the expected size of the range.

x0 = 0   # or x.min()
x1 = 1   # or x.max()
x_bin = (x1 - x0) / N
x_bin_idx = ((x - x0) // x_bin).astype(int)

# ditto for y

This will be faster and simpler than digitizing, and avoids the extra bin at the beginning.

For most purposes, you do not need np.where . 90% of the questions asking about it (including this one) should not be using where . If you want a fast way to access the necessary elements of x and y , just use a boolean mask. The mask is simply

selction = mask[x_bin_idx, y_bin_idx].astype(bool)

If mask is already a boolean (which it should be anyway), the expression mask[x_bin_idx, y_bin_idx] is sufficient. It results in an array of the same size as x_bin_idx and y_bin_idx (which are the same size as x and y ) containing the mask value for each of your points. You can use the mask as

x[selection]   # Elements of x in mask
y[selection]   # Elements of y in mask

If you absolutely need the integer indices, where is sill not your best option.

indices = np.flatnonzero(selection)

OR

indices = selection.nonzero()[0]

If your goal is simply to extract values from x and y , I would recommend stacking them together into a single array:

coords = np.stack((x, y), axis=1)

This way, instead of having to apply indices twice, you can extract the values with just

coords[selection, :]

OR

coords[indices, :]

Depending on the relative densities of mask and x and y , either the boolean masking or linear indexing may be faster. You will have to time some relevant cases to get a better intuition.

Efficiently select elements from an (x,y) field with a 2D mask in Python

Question

1 answers

solution1
2 ACCPTED 2020-03-30 14:58:29

Efficiently select elements from an (x,y) field with a 2D mask in Python

Question

1 answers

solution1 2 ACCPTED 2020-03-30 14:58:29

solution1
2 ACCPTED 2020-03-30 14:58:29