简体   繁体   中英

Search numpy array ((x, y, z)…) for z matching nearest x, y

I have a very large array similar to elevation data of the format:

triplets = ((x0, y0, z0), 
            (x1, y1, z1), 
            ... ,
            (xn, yn, zn))

where x, y, z are all floats in metres. You can create suitable test data matching this format with:

x = np.arange(20, 40, dtype=np.float64)
y = np.arange(30, 50, dtype=np.float64)
z = np.random.random(20) * 25.0
triplets = np.hstack((x, y, z)).reshape((len(x),3))

I want to be able to efficiently find the corresponding z-value for a given (x,y) pair. My research so far leads to more questions. Here's what I've got:

  1. Iterate through all of the triplets:

     query = (a, b) # where a, b are the x and y coordinates we're looking for for i in triplets: if i[0] == query[0] and i[1] == query[1]: result = i[2] 

    Drawbacks: slow; a, b must exist, which is a problem with comparing floats.

  2. Use scipy.spatial.cKDTree to find nearest:

     points = triplets[:,0:2] # drops the z column tree = cKDTree(points) idx = tree.query((a, b))[1] # this returns a tuple, we want the index query = tree.data[idx] result = triplets[idx, 2] 

    Drawbacks: returns nearest point rather than interpolating.

  3. Using interp2d as per comment:

     f = interp2d(x, y, z) result = f(a, b) 

    Drawbacks: doesn't work on a large dataset. I get OverflowError: Too many data points to interpolate when run on real data. (My real data is some 11 million points.)

So the question is: is there any straightforward way of doing this that I'm overlooking? Are there ways to reduce the drawbacks of the above?

If you want to interpolate the result, rather than just find the z value for the nearest neighbour, I would consider doing something like the following:

  1. Use a kd tree to partition your data points according to their (x, y) coordinates
  2. For a given (xi, yi) point to interpolate, find its k nearest neighbours
  3. Take the average of their z values, weighted according to their distance from (xi, yi)

The code might look something like this:

import numpy as np
from scipy.spatial import cKDTree

# some fake (x, y, z) data
XY = np.random.rand(10000, 2) - 0.5
Z = np.exp(-((XY ** 2).sum(1) / 0.1) ** 2)

# construct a k-d tree from the (x, y) coordinates
tree = cKDTree(XY)

# a random point to query
xy = np.random.rand(2) - 0.5

# find the k nearest neighbours (say, k=3)
distances, indices = tree.query(xy, k=3)

# the z-values for the k nearest neighbours of xy
z_vals = Z[indices]

# take the average of these z-values, weighted by 1 / distance from xy
dw_avg = np.average(z_vals, weights=(1. / distances))

It's worth playing around a bit with the value of k , the number of nearest neighbours to take the average of. This is essentially a crude form of kernel density estimation , where the value of k controls the degree of 'smoothness' you're imposing on the underlying distribution of z-values. A larger k results in more smoothness.

Similarly, you might want to play around with how you weight the contributions of points according to their distance from (xi, yi) , depending on how you think similarity in z decreases with increasing x, y distance. For example you might want to weight by (1 / distances ** 2) rather than (1 / distances) .

In terms of performance, constructing and searching kd trees are both very efficient . Bear in mind that you only need to construct the tree once for your dataset, and if necessary you can query multiple points at a time by passing (N, 2) arrays to tree.query() .

Tools for approximate nearest neighbour searches, such as FLANN , might potentially be quicker, but these are usually more helpful in situations when the dimensionality of your data is very high.

I don't understand your cKDTree code, you got the idx , why do the for loop again? You can get the result just by result = triplets[idx, 2] .

from scipy.spatial import cKDTree

x = np.arange(20, 40, dtype=np.float64)
y = np.arange(30, 50, dtype=np.float64)
z = np.random.random(20) * 25.0
triplets = np.hstack((x, y, z)).reshape((len(x),3))

a = 30.1
b = 40.5

points = triplets[:,0:2] # drops the z column
tree = cKDTree(points)
idx = tree.query((a, b))[1] # this returns a tuple, we want the index
result = triplets[idx, 2]

You can create a sparse matrix and use simple indexing.

In [1]: import numpy as np
In [2]: x = np.arange(20, 40, dtype=np.float64)
In [3]: y = np.arange(30, 50, dtype=np.float64)
In [4]: z = np.random.random(20) * 25.0
In [9]: from scipy.sparse import coo_matrix
In [12]: m = coo_matrix((z, (x, y))).tolil()
In [17]: m[25,35]
Out[17]: 17.410532044604292

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM