Interpolate unstructured X,Y,Z data on best grid based on nearest neighbour distance for each points

Question

This question was edited after answers for show final solution I used

I have unstructured 2D datasets coming from different sources, like by example: Theses datasets are 3 numpy.ndarray (X, Y coordinates and Z value).

My final aim is to interpolate theses datas on a grid for conversion to image/matrix. So, I need to find the "best grid" for interpolate theses datas. And, for this I need to find the best X and Y step between pixels of that grid.

Determinate step based on Euclidean distance between points:

Use the mean of Euclidean distances between each point and its nearest neighbour.

Use KDTree / cKDTree from scipy.spacial for build tree of the X,Y datas.
Use the query method with k=2 for get the distances (If k=1 , distances are only zero because query for each point found itself).



    # Generate KD Tree
    xy = np.c_[x, y]  # X,Y data converted for use with KDTree
    tree = scipy.spacial.cKDTree(xy)  # Create KDtree for X,Y coordinates.

    # Calculate step
    distances, points = tree.query(xy, k=2)  # Query distances for X,Y points
    distances = distances[:, 1:]  # Remove k=1 zero distances
    step = numpy.mean(distances)  # Result

Performance tweaking:

Use of scipy.spatial.cKDTree and not scipy.spatial.KDTree because it is really faster.
Use balanced_tree=False with scipy.spatial.cKDTree : Big speed up in my case, but may not be true for all data.
Use n_jobs=-1 with cKDTree.query for use multithreading.
Use p=1 with cKDTree.query for use Manhattan distance in place of Euclidian distance ( p=2 ): Faster but may be less accurate.
Query the distance for only a random subsample of points: Big speed up with large datasets, but may be less accurate and less repeatable.

Interpolate points on grid:

Interpolate dataset points on grid using the calculated step.



    # Generate grid
    def interval(axe):
        '''Return numpy.linspace Interval for specified axe'''
        cent = axe.min() + axe.ptp() / 2  # Interval center
        nbs = np.ceil(axe.ptp() / step)  # Number of step in interval
        hwid = nbs * step / 2  # Half interval width 
        return np.linspace(cent - hwid, cent + hwid, nbs)  # linspace

    xg, yg = np.meshgrid(interval(x), interval(y))  # Generate grid

    # Interpolate X,Y,Z datas on grid
    zg = scipy.interpolate.griddata((x, y), z, (xg, yg))

Set NaN if pixel too far from initials points:

Set NaN to pixels from grid that are too far (Distance > step) from points from initial X,Y,Z data. The previous generated KDTree is used.



    # Calculate pixel to X,Y,Z data distances
    dist, _ = tree.query(np.c_[xg.ravel(), yg.ravel()])
    dist = dist.reshape(xg.shape)

    # Set NaN value for too far pixels
    zg[dist > step] = np.nan

Answer 1

The problem you want to solve is called the "all-nearest-neighbors problem". See this article for example: http://link.springer.com/article/10.1007/BF02187718

I believe solutions to this are O(N log N), so on the same order as KDTree.query, but in practice much, much faster than a bunch of separate queries. I'm sorry, I don't know of a python implementation of this.

Answer 2

I suggest you to go with KDTree.query .

You are searching of a carachteristic distance to scale your binning: I suggest you to take only a random subset of your points, and to use the Manhattan distance , becasue KDTree.query is very slow (and yet it is an*log(n) complexity).

Here is my code:

# CreateTree
tree=scipy.spatial.KDTree(numpy.array(points)) # better give it a copy?
# Create random subsample of points
n_repr=1000
shuffled_points=numpy.array(points)
numpy.random.shuffle(shuffled_points)
shuffled_points=shuffled_points[:n_repr]
# Query the tree
(dists,points)=tree.query(shuffled_points,k=2,p=1)
# Get _extimate_ of average distance:
avg_dists=numpy.average(dists)
print('average distance Manhattan with nearest neighbour is:',avg_dists)

I suggest you to use the Manhattan distance ( https://en.wikipedia.org/wiki/Taxicab_geometry ) because it is was faster to compute than the euclidean distance. And since you need only an estimator of the average distance it should be sufficient.

Interpolate unstructured X,Y,Z data on best grid based on nearest neighbour distance for each points

Question

2 answers

solution1
2 2016-01-16 03:54:19

solution2
0 ACCPTED 2016-01-15 15:10:12

Interpolate unstructured X,Y,Z data on best grid based on nearest neighbour distance for each points

Question

2 answers

solution1 2 2016-01-16 03:54:19

solution2 0 ACCPTED 2016-01-15 15:10:12

solution1
2 2016-01-16 03:54:19

solution2
0 ACCPTED 2016-01-15 15:10:12