I am trying to avoid looping by using the 'apply' function to apply inline functions on all the rows of a dataframe.
The thing is, I have ~800 Points (truck stops), and I am trying to determine which of these are along some route, which itself is defined by ~100k Points.
My method is to calc euclidean distances between a truckstop to each point on route, and if any of these distances is less than some value I retain the route.
I initially did this by looping but it was super slow (assuming I don't break loop when distance less than some value, it is like 100k*800 iterrations).
So I tried using 'apply' but it is still slow. Does anyone know a way I can optimize this?
FULL CODE:
import pandas as pd
import numpy as np
import time, os
BASE_DIR='C:\\Users\\aidenm\\Desktop\\geo'
rt_df = pd.read_csv(os.path.join(BASE_DIR, 'test_route.txt'))
'''
lon, lat
-118.410339, 34.019653
-118.410805, 34.020241
-118.411301, 34.020863
-118.411766, 34.021458
...
'''
fm_df = pd.read_csv(os.path.join(BASE_DIR, 'test_fm.txt'))
'''
lat, lon
41.033959, -77.515672
41.785524, -80.853175
41.128748, -80.769934
41.465085, -82.060677
...
'''
def is_on_route_inline(x, route_coordinates):
'''
:param route_coordinates:
:param fencing_module_coordinate:
:return: True if on route else False
'''
a = np.array((float(x[0]), float(x[1])))
# bs = [np.array((c[1], c[0])) for c in rcs]
def distance_inline(b, fcm_point):
return np.linalg.norm(b-fcm_point)
# bss = pd.Series(bs)
distances = route_coordinates.apply(distance_inline, args=(a,), axis=1) #np.linalg.norm(a-b))
# distances = [np.linalg.norm(a-b) for b in bs]
if min(distances)<0.1:
print(x)
return True
return False
fm_df.apply(is_on_route_inline, args=(rt_df,), axis=1)#rt_df)
To do this quickly you'll want to convert the data from the DataFrame into a Numpy array. To start, let's compute the distance between one truck stop and all route points–
# Create Numpy array of shape (100k, 2)
route_points = rt_df[['lat', 'lon']].values
truck_stop = # get one truck stop location shape (2, )
# Compute distances
dists = np.linalg.norm(route_points - truck_stop, axis=1)
This lets Numpy broadcasting handle looping over all route locations for you (very fast). However, it sounds like what you really need is the the distance between all pairs of truck-stops and route-points. Its tricky to get Numpy broadcasting to do this so I'd recommend using scipy.spatial.distance_matrix
from scipy.spatial import distance_matrix
route_points = rt_df[['lat', 'lon']].values # shape (100k, 2)
truck_points = fm_df[['lat', 'lon']].values # shape (800, 2)
all_distances = distance_matrix(route_points, truck_points) # shape (100k, 800)
Now all_distances
is a Numpy array containing all pair-wise distances, so all_distances[i, j]
is the distance between route i
and truck stop j
. Again, this lets Numpy handle looping over the 100k * 800 iterations for you and is very fast. (On my laptop it took ~3 seconds to complete this with similarly sized arrays).
After that, you can find the distances that are small enough
all_distances < 0.1
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.