简体   繁体   English

最近邻居搜索和VPTree的性能问题

[英]Performance issues with Nearest Neighbor Search and VPTrees

I have read this Q/A - knn with big sparse matrices in python and I have a similar problem. 我已经读过这个Q / A- 在python中具有大稀疏矩阵的knn,我有一个类似的问题。 I have a sparse array of radar data of size - 125930 and longitude and latitude have identical shape. 我的雷达数据稀疏数组,大小为125930,经度和纬度具有相同的形状。 Only 5 % of the data is not NULL. 只有5%的数据不是NULL。 The rest are all NULLs. 其余全为NULL。

Data is available on a sphere so I use VPTree and great circle distance to compute distances. 数据在球体上可用,因此我使用VPTree和大圆距来计算距离。 The grid spacing is irregular and I would like to interpolate this data to a regular grid on a sphere with distance in the lat and lon direction with grid spacing of 0.05 degrees. 网格间距是不规则的,我想将此数据插值到球面上的规则网格中,该球面在纬度和经度方向上的距离为0.05度。 The spacing between two latitudes is 0.01 in the coarse grid and spacing between two longitudes is 0.09. 在粗网格中,两个纬度之间的间距为0.01,两个经度之间的间距为0.09。 So I create my mesh grid in the following way and I have the following number of grid points - 12960000 in total based on the maximum value of latitude and longitude of the irregular grid. 因此,我按照以下方式创建了网格网格,并且我具有以下网格点数-基于不规则网格的经度和纬度的最大值,总计为12960000。

latGrid = np.arange(minLat,maxLat,0.05)
lonGrid = np.arange(minLo,maxLo,0.05)


gridLon,gridLat = np.meshgrid(lonGrid,latGrid)
grid_points = np.c_[gridLon.ravel(),gridLat.ravel()]

radar_data = radar_element[np.nonzero(radar_element)]
lat_surface = lat[np.nonzero(radar_element)]
lon_surface = lon[np.nonzero(radar_element)]

points = np.c_[lon_surface,lat_surface]
if points.size > 0:
   tree = vptree.VPTree(points,greatCircleDistance)
    for grid_point in (grid_points):
        indices = tree.get_all_in_range(grid_point,4.3)
        args.append(indices)

The problem is the query 问题是查询

get_all_in_range

It currently takes 12 minutes to run for every pass of the above data and I have a total of 175 passes and the total time is 35 hours which is unacceptable.Is there any way to reduce the number of grid points(based on some similarity) that is sent to the query as the bulk of the indices that is returned back is null ? 以上数据的每次通过目前需要12分钟才能运行,我总共有175次通过,总时间为35小时,这是不可接受的。是否有任何方法可以减少网格点的数量(基于一些相似性)发送给查询的索引,因为返回的大部分索引为null? I have also used Scikit-learn's BallTree and the performance is even worse than this one. 我还使用了Scikit-learn的BallTree,其性能甚至比这还差。 I am not sure whether FLANN is an appropriate usage for my problem. 我不确定FLANN是否适合我的问题。

I would just convert to 3D coordinates and use Euclidean distance. 我只是将其转换为3D坐标并使用欧几里得距离。

You can use something like Annoy (disclosure: I'm the author) 您可以使用类似Annoy的方式 (公开:我是作者)

Example from something I built: https://github.com/erikbern/ping/blob/master/plot.py 我构建的示例: https : //github.com/erikbern/ping/blob/master/plot.py

I would first put your radar observations into a spatial index as lat/lon. 首先,我将您的雷达观测值作为纬度/经度放入空间索引中。 For the sake of Python, let's use R-Tree. 为了使用Python,让我们使用R-Tree。 I would follow this concept: 我会遵循以下概念:

http://toblerity.org/rtree/tutorial.html#using-rtree-as-a-cheapo-spatial-database http://toblerity.org/rtree/tutorial.html#using-rtree-as-a-cheapo-spatial-database

Load your radar observations: 加载雷达观测值:

for id, (y, x, m) in enumerate(observations):
    index.insert(id=id, bounds=(x, y, x, y), obj=(y,x,m))

Then for your desired Great Circle distance I would calculate a "safe" Euclidean distance to filter out candidate points. 然后,对于您所需的大圆弧距离,我将计算一个“安全”欧几里德距离以滤除候选点。

You can query the R-Tree for candidate points near the (x,y) of your output grid points: 您可以在R树中查询输出网格点(x,y)附近的候选点:

candidates  =  idx.intersection((x - safe_distance, y - safe_distance, x + safe_distance, y+safe distance), objects=True)]

This will give you list of candidate points as [(y, x, m),...] 这将为您提供候选点列表,形式为[(y, x, m),...]

Now filter the candidates using a Great Circle calculation. 现在使用Great Circle计算过滤候选对象。 You can then do the interpolation with the remaining point objects. 然后,您可以对其余点对象进行插值。

Here's another strategy that approaches the problem in the opposite direction. 这是另一种解决问题的策略。 I think this is a better approach for several reasons: 我认为这是一种更好的方法,原因如下:

  • The radar observation dataset is sparse, so to run calculations on every output point seems wasteful, even with the help of a spatial index. 雷达观测数据集稀疏,因此即使在空间索引的帮助下,在每个输出点上进行计算也很浪费。
  • The output grid has regular spacing so it can be easily calculated. 输出网格具有规则的间距,因此可以轻松进行计算。
  • So it would be less work to look at every observation point and calculate which output points are nearby, and use that information to build a list of output points and which observation points they are close to. 因此,查看每个观察点并计算附近有哪些输出点,并使用该信息来构建输出点列表以及它们靠近哪些观察点的工作会更少。

The observation data is in the form (X,Y,M) ( longitude, latitude, measurement ). 观测数据的格式为(X,Y,M)( 经度,纬度,测量值 )。

The output is a grid with regular spacing, like every .1 degrees. 输出是一个具有规则间距的网格,例如每.1度。

First create a dictionary for your output grid points that are near to observations: 首先为您的靠近观测值的输出网格点创建字典:

output = {}

Then take an observation point and find points nearby that are within the Great Circle distance. 然后获取一个观察点,并在大圆环距离之内找到附近的点。 Start checking at a nearby output point and iterate your way outwards by row/column until you have found all the possible observation points within your GCD. 在附近的输出点开始检查,并按行/列向外迭代,直到在GCD中找到所有可能的观察点。

This will give you a list of grid points that are within the GCD of X and Y. Something like: 这将为您提供X和Y的GCD内的网格点列表。类似:

get_points(X,Y) ----> [[x1, y1], [x2,y2]....]

Now we'll flip this around. 现在,我们将其翻转。 We want to store each output point and a list of the observation points that are near it. 我们要存储每个输出点和附近的观察点列表。 To store the point in the output dictionary we need some kind of unique key. 要将点存储在输出字典中,我们需要某种唯一键。 A geohash (which interleaves latitude and longitude and generates a unique string) is perfect for this. 对此,geohash (可将纬度和经度交织并生成唯一的字符串)非常适合。

For each output point (x n , y n ) compute the geohash, and add an entry to the output dictionary with (x n , y n ) and start (or append to) a list of observations: 对于每个输出点(x n ,y n ),计算出geohash,并使用(x n ,y n )将条目添加到输出字典中,并开始(或附加到)观测列表:

key = Geohash.encode(y,x)
if key not in output:
    output[key] = { 'coords': [x, y], 'observations' = [[X,Y,M]] }
else:
    output[key][observations].append([X,Y,M])

We store the original x,y instead of reversing the geohash and losing accuracy. 我们存储原始的x,y而不是反转geohash并失去准确性。

When you have run all of your observations you will then have a dictionary of all the output points that require calculation, with each point having a list of observations that are within the GCD requirement. 运行完所有观测值后,您将拥有一个需要计算的所有输出点的字典,每个点都具有GCD要求范围内的观测值列表。

You can then loop through the points and calculate the output array index and interpolated value, and write that to your output array: 然后,您可以遍历这些点并计算输出数组的索引和内插值,并将其写入输出数组:

def get_indices(x,y):
    ''' converts a x,y to row,column in the output array '''
    ....
    return row, column

def get_weighted value(point):
    ''' takes a point entry and performs inverse distance weighting
        using the point's coordinates and list of observations '''
    ....
    return value

for point in output:
    row, column = get_indices(point['coords'])
    idw = get_weighted_value(point)
    outarray[column,row] = idw

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM