简体   繁体   English

优化 2 个 **long** 二维点阵列之间的距离计算

[英]Optimize distance calculations between 2 **long**, 2-D arrays of points

I am trying to avoid looping by using the 'apply' function to apply inline functions on all the rows of a dataframe.我试图通过使用“应用”函数在数据帧的所有行上应用内联函数来避免循环。

The thing is, I have ~800 Points (truck stops), and I am trying to determine which of these are along some route, which itself is defined by ~100k Points.问题是,我有大约 800 个点(卡车停靠点),我试图确定其中哪些是沿着某条路线的,该路线本身由大约 100k 点定义。

My method is to calc euclidean distances between a truckstop to each point on route, and if any of these distances is less than some value I retain the route.我的方法是计算卡车停靠站到路线上每个点之间的欧几里德距离,如果这些距离中的任何一个小于某个值,我就会保留路线。

I initially did this by looping but it was super slow (assuming I don't break loop when distance less than some value, it is like 100k*800 iterrations).我最初是通过循环来做到这一点的,但它非常慢(假设我在距离小于某个值时不会中断循环,就像 100k*800 次迭代)。

So I tried using 'apply' but it is still slow.所以我尝试使用“应用”,但它仍然很慢。 Does anyone know a way I can optimize this?有谁知道我可以优化这个的方法吗?

FULL CODE:完整代码:

import pandas as pd
import numpy as np
import time, os

BASE_DIR='C:\\Users\\aidenm\\Desktop\\geo'

rt_df = pd.read_csv(os.path.join(BASE_DIR, 'test_route.txt'))
'''
lon, lat
-118.410339, 34.019653
-118.410805, 34.020241
-118.411301, 34.020863
-118.411766, 34.021458
...
'''

fm_df = pd.read_csv(os.path.join(BASE_DIR, 'test_fm.txt'))
'''
lat, lon
41.033959, -77.515672
41.785524, -80.853175
41.128748, -80.769934
41.465085, -82.060677
...
'''



def is_on_route_inline(x, route_coordinates):
    '''

    :param route_coordinates:
    :param fencing_module_coordinate:
    :return: True if on route else False
    '''



    a = np.array((float(x[0]), float(x[1])))
    # bs = [np.array((c[1], c[0])) for c in rcs]


    def distance_inline(b, fcm_point):
        return np.linalg.norm(b-fcm_point)

    # bss = pd.Series(bs)
    distances = route_coordinates.apply(distance_inline, args=(a,), axis=1)   #np.linalg.norm(a-b))

    # distances = [np.linalg.norm(a-b) for b in bs]

    if min(distances)<0.1:
        print(x)
        return True

    return False

fm_df.apply(is_on_route_inline,  args=(rt_df,), axis=1)#rt_df)



To do this quickly you'll want to convert the data from the DataFrame into a Numpy array.要快速执行此操作,您需要将 DataFrame 中的数据转换为 Numpy 数组。 To start, let's compute the distance between one truck stop and all route points–首先,让我们计算一个卡车停靠点和所有路线点之间的距离——

# Create Numpy array of shape (100k, 2)
route_points = rt_df[['lat', 'lon']].values

truck_stop = # get one truck stop location shape (2, )

# Compute distances
dists = np.linalg.norm(route_points - truck_stop, axis=1) 

This lets Numpy broadcasting handle looping over all route locations for you (very fast).这让 Numpy 广播为您处理所有路由位置的循环(非常快)。 However, it sounds like what you really need is the the distance between all pairs of truck-stops and route-points.但是,听起来您真正需要的是所有卡车停靠点和路线点之间的距离。 Its tricky to get Numpy broadcasting to do this so I'd recommend using scipy.spatial.distance_matrix让 Numpy 广播来做到这一点很棘手,所以我建议使用scipy.spatial.distance_matrix

from scipy.spatial import distance_matrix

route_points = rt_df[['lat', 'lon']].values  # shape (100k, 2)
truck_points = fm_df[['lat', 'lon']].values  # shape (800, 2)

all_distances = distance_matrix(route_points, truck_points) # shape (100k, 800)

Now all_distances is a Numpy array containing all pair-wise distances, so all_distances[i, j] is the distance between route i and truck stop j .现在all_distances是一个包含所有成对距离的 Numpy 数组,所以all_distances[i, j]是路线i和卡车停靠点j之间的距离。 Again, this lets Numpy handle looping over the 100k * 800 iterations for you and is very fast.同样,这让 Numpy 可以为您处理 100k * 800 次迭代,并且速度非常快。 (On my laptop it took ~3 seconds to complete this with similarly sized arrays). (在我的笔记本电脑上,用类似大小的阵列完成这个需要大约 3 秒)。

After that, you can find the distances that are small enough之后,你可以找到足够小的距离

all_distances < 0.1

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM