Optimize distance calculations between 2 long, 2-D arrays of points

Question

I am trying to avoid looping by using the 'apply' function to apply inline functions on all the rows of a dataframe.

The thing is, I have ~800 Points (truck stops), and I am trying to determine which of these are along some route, which itself is defined by ~100k Points.

My method is to calc euclidean distances between a truckstop to each point on route, and if any of these distances is less than some value I retain the route.

I initially did this by looping but it was super slow (assuming I don't break loop when distance less than some value, it is like 100k*800 iterrations).

So I tried using 'apply' but it is still slow. Does anyone know a way I can optimize this?

FULL CODE:

import pandas as pd
import numpy as np
import time, os

BASE_DIR='C:\\Users\\aidenm\\Desktop\\geo'

rt_df = pd.read_csv(os.path.join(BASE_DIR, 'test_route.txt'))
'''
lon, lat
-118.410339, 34.019653
-118.410805, 34.020241
-118.411301, 34.020863
-118.411766, 34.021458
...
'''

fm_df = pd.read_csv(os.path.join(BASE_DIR, 'test_fm.txt'))
'''
lat, lon
41.033959, -77.515672
41.785524, -80.853175
41.128748, -80.769934
41.465085, -82.060677
...
'''



def is_on_route_inline(x, route_coordinates):
    '''

    :param route_coordinates:
    :param fencing_module_coordinate:
    :return: True if on route else False
    '''



    a = np.array((float(x[0]), float(x[1])))
    # bs = [np.array((c[1], c[0])) for c in rcs]


    def distance_inline(b, fcm_point):
        return np.linalg.norm(b-fcm_point)

    # bss = pd.Series(bs)
    distances = route_coordinates.apply(distance_inline, args=(a,), axis=1)   #np.linalg.norm(a-b))

    # distances = [np.linalg.norm(a-b) for b in bs]

    if min(distances)<0.1:
        print(x)
        return True

    return False

fm_df.apply(is_on_route_inline,  args=(rt_df,), axis=1)#rt_df)

Answer 1

To do this quickly you'll want to convert the data from the DataFrame into a Numpy array. To start, let's compute the distance between one truck stop and all route points–

# Create Numpy array of shape (100k, 2)
route_points = rt_df[['lat', 'lon']].values

truck_stop = # get one truck stop location shape (2, )

# Compute distances
dists = np.linalg.norm(route_points - truck_stop, axis=1)

This lets Numpy broadcasting handle looping over all route locations for you (very fast). However, it sounds like what you really need is the the distance between all pairs of truck-stops and route-points. Its tricky to get Numpy broadcasting to do this so I'd recommend using scipy.spatial.distance_matrix

from scipy.spatial import distance_matrix

route_points = rt_df[['lat', 'lon']].values  # shape (100k, 2)
truck_points = fm_df[['lat', 'lon']].values  # shape (800, 2)

all_distances = distance_matrix(route_points, truck_points) # shape (100k, 800)

Now all_distances is a Numpy array containing all pair-wise distances, so all_distances[i, j] is the distance between route i and truck stop j . Again, this lets Numpy handle looping over the 100k * 800 iterations for you and is very fast. (On my laptop it took ~3 seconds to complete this with similarly sized arrays).

After that, you can find the distances that are small enough

all_distances < 0.1

Optimize distance calculations between 2 long, 2-D arrays of points

Question

1 answers

solution1
1 ACCPTED 2020-10-04 18:46:22

Optimize distance calculations between 2 **long**, 2-D arrays of points

Question

1 answers

solution1 1 ACCPTED 2020-10-04 18:46:22

Optimize distance calculations between 2 long, 2-D arrays of points

solution1
1 ACCPTED 2020-10-04 18:46:22