[英]Python - Distance matrix between geographic coordinates
I have a dataframe panda with over 600 geographic coordinate points.我有一个拥有 600 多个地理坐标点的 dataframe 熊猫。 An extract from him follows below:他的摘录如下:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
from math import sin, cos, sqrt, atan2, radians
lat_long = pd.DataFrame({'LATITUDE':[-22.98, -22.97, -22.92, -22.87, -22.89], 'LONGITUDE': [-43.19, -43.39, -43.24, -43.28, -43.67]})
lat_long
To calculate the distance between two points manually, I use the code below:要手动计算两点之间的距离,我使用以下代码:
lat1 = radians(lat_long['LATITUDE'][0])
lon1 = radians(lat_long['LONGITUDE'][0])
lat2 = radians(lat_long['LATITUDE'][1])
lon2 = radians(lat_long['LONGITUDE'][1])
R = 6373.0
dlon = lon2 - lon1
dlat = lat2 - lat1
a = sin(dlat / 2)**2 + cos(lat1) * cos(lat2) * sin(dlon / 2)**2
c = 2 * atan2(sqrt(a), sqrt(1 - a))
distance = R * c
print("Result:", round(distance,4))
What I need to do is create a function that uses the formula above to calculate the distance from all points to all, as in an array.我需要做的是创建一个 function 使用上面的公式来计算从所有点到所有点的距离,就像在数组中一样。 But I have trouble thinking about what function to do and store the distances between the points.但我很难考虑 function 做什么和存储点之间的距离。 Every help is welcome.欢迎任何帮助。 Output example (For illustrative purposes only, if I have not been clear): Output 示例(仅用于说明目的,如果我不清楚的话):
| |point 0 | point1 | point2 |
|point0 | 0 | 2 | 3 |
|point1 | 2 | 0 | 4 |
|point2 | 3 | 4 | 0 |
|distance|distance|distance|
You could use pdist to compute the pairwise distances:您可以使用pdist计算成对距离:
import pandas as pd
import numpy as np
from math import sin, cos, sqrt, atan2, radians
from scipy.spatial.distance import pdist, squareform
lat_long = pd.DataFrame({'LATITUDE': [-22.98, -22.97, -22.92, -22.87, -22.89], 'LONGITUDE': [-43.19, -43.39, -43.24, -43.28, -43.67]})
def dist(x, y):
"""Function to compute the distance between two points x, y"""
lat1 = radians(x[0])
lon1 = radians(x[1])
lat2 = radians(y[0])
lon2 = radians(y[1])
R = 6373.0
dlon = lon2 - lon1
dlat = lat2 - lat1
a = sin(dlat / 2) ** 2 + cos(lat1) * cos(lat2) * sin(dlon / 2) ** 2
c = 2 * atan2(sqrt(a), sqrt(1 - a))
distance = R * c
return round(distance, 4)
distances = pdist(lat_long.values, metric=dist)
points = [f'point_{i}' for i in range(1, len(lat_long) + 1)]
result = pd.DataFrame(squareform(distances), columns=points, index=points)
print(result)
Output Output
point_1 point_2 point_3 point_4 point_5
point_1 0.0000 20.5115 8.4123 15.3203 50.1784
point_2 20.5115 0.0000 16.3400 15.8341 30.0319
point_3 8.4123 16.3400 0.0000 6.9086 44.1838
point_4 15.3203 15.8341 6.9086 0.0000 40.0284
point_5 50.1784 30.0319 44.1838 40.0284 0.0000
Notice that squareform
converts from a sparse matrix to a dense one, so the results are store in a numpy array.请注意, squareform
从稀疏矩阵转换为密集矩阵,因此结果存储在 numpy 数组中。
Another possible solution is另一种可能的解决方案是
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
from math import sin, cos, sqrt, atan2, radians
lat_long = pd.DataFrame({'LATITUDE':[-22.98, -22.97, -22.92, -22.87, -22.89], 'LONGITUDE': [-43.19, -43.39, -43.24, -43.28, -43.67]})
lat_long
test = lat_long.iloc[2:,:]
def distance(city1, city2):
lat1 = radians(city1['LATITUDE'])
lon1 = radians(city1['LONGITUDE'])
lat2 = radians(city2['LATITUDE'])
lon2 = radians(city2['LONGITUDE'])
R = 6373.0
dlon = lon2 - lon1
dlat = lat2 - lat1
a = sin(dlat / 2)**2 + cos(lat1) * cos(lat2) * sin(dlon / 2)**2
c = 2 * atan2(sqrt(a), sqrt(1 - a))
distance = R * c
return distance
dist = np.zeros([lat_long.shape[0],lat_long.shape[0]])
for i1, city1 in lat_long.iterrows():
for i2, city2 in lat_long.iloc[i1+1:,:].iterrows():
dist[i1,i2] = distance(city1, city2)
print(dist)
Output Output
[[ 0. 20.51149047 8.41230771 15.32026132 50.17836849]
[ 0. 0. 16.33997119 15.83407186 30.03192954]
[ 0. 0. 0. 6.90864606 44.18376436]
[ 0. 0. 0. 0. 40.02842872]
[ 0. 0. 0. 0. 0. ]]
The lower triangle of the distance matrix is empty since that the matrix is symmetric ( dist[i1,i2]==dist[i2,i1]
)距离矩阵的下三角形是空的,因为矩阵是对称的( dist[i1,i2]==dist[i2,i1]
)
This seems twice faster:这似乎快了两倍:
# imports
import pandas as pd
import numpy as np
# supporting functions
def create_cartestin(df: pd.DataFrame):
"""
This function returns cartesian of a dataframe with itself.
df:
dataframe to combine with itself
"""
# create artifical id
df['temp_id'] = [i for i in range(len(df))]
# create cartesian merging key
df['temp_key'] = 1
df_cartesian = df.merge(df, on=['temp_key']).drop(columns=['temp_key'])
return df_cartesian
def hav(theta):
return np.sin(theta/2)**2
def give_me_straight_line_distance(
df: pd.DataFrame,
lattitude_x: str,
longitude_x: str,
lattitude_y: str,
longitude_y: str
):
"""
This function calculates distance between coordinates with haversine formula.
df:
dataframe containing cartesian product of tables containing points of interest
lattitude_x:
name of column containing lattitude of points x (1st set of points of interest)
longitude_x:
name of column containing longitude of points x (1st set of points of interest)
lattitude_y:
name of column containing lattitude of points y (2nd set of points of interest)
longitude_y:
name of column containing longitude of points y (2nd set of points of interest)
"""
# assumed Earth radius
r = 6371.009
coords = df[[lattitude_x,longitude_x,lattitude_y,longitude_y]].values
coordinates = np.deg2rad(coords)
lat1 = coordinates[:, 0]
lng1 = coordinates[:, 1]
lat2 = coordinates[:, 2]
lng2 = coordinates[:, 3]
coslat1 = np.cos(lat1)
coslat2 = np.cos(lat2)
t = hav(lat2-lat1) + coslat1[:]*coslat2[:]*hav(lng2-lng1)
d = 2*r*np.arcsin(np.sqrt(t))
return d
# THE FUNCTION
def give_me_distance_matrix(df: pd.DataFrame):
# create cartesian
df_cartesian = create_cartestin(df)
# calc distance for each pair of points
df_cartesian['distance_km'] = \
give_me_straight_line_distance(
df = df_cartesian,
lattitude_x = 'LATITUDE_x',
longitude_x = 'LONGITUDE_x',
lattitude_y = 'LATITUDE_y',
longitude_y = 'LONGITUDE_y'
)
# turn into matrix format
df_cartesian = df_cartesian.set_index(['temp_id_x','temp_id_y'])[['distance_km']].unstack(['temp_id_y'])
# erasing artifical names
df_cartesian = df_cartesian.reset_index(drop = True)
df_cartesian = df_cartesian.T.reset_index(drop = True)
return df_cartesian
lat_long = pd.DataFrame({'LATITUDE':[-22.98, -22.97, -22.92, -22.87, -22.89]*100, 'LONGITUDE': [-43.19, -43.39, -43.24, -43.28, -43.67]*100})
lat_long.shape
(500, 2)
%%timeit
result = give_me_distance_matrix(lat_long)
244 ms ± 33.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Than:比:
import pandas as pd
import numpy as np
from math import sin, cos, sqrt, atan2, radians
from scipy.spatial.distance import pdist, squareform
def dist(x, y):
"""Function to compute the distance between two points x, y"""
lat1 = radians(x[0])
lon1 = radians(x[1])
lat2 = radians(y[0])
lon2 = radians(y[1])
R = 6373.0
dlon = lon2 - lon1
dlat = lat2 - lat1
a = sin(dlat / 2) ** 2 + cos(lat1) * cos(lat2) * sin(dlon / 2) ** 2
c = 2 * atan2(sqrt(a), sqrt(1 - a))
distance = R * c
return round(distance, 4)
%%timeit
distances = pdist(lat_long.values, metric=dist)
points = [f'point_{i}' for i in range(1, len(lat_long) + 1)]
result = pd.DataFrame(squareform(distances), columns=points, index=points)
563 ms ± 77.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.