如何加快 Pandas 地理距离计算？

Question

I am a Pandas / Python Beginner and I don't know how I can speed up my code.我是 Pandas / Python 初学者，我不知道如何加速我的代码。 I have a unsorted Pandas Dataframe called test with about 10.000 rows with multiple columns, including latitude and longitude.我有一个名为 test 的未排序 Pandas Dataframe，它有大约 10.000 行和多列，包括纬度和经度。 I want to know, for each row, how many other rows are close by (within distance of a threshold ie 10 km).我想知道，对于每一行，附近有多少其他行（在阈值的距离内，即 10 公里）。

I tried doing that:我试过这样做：

import numpy as np
def distance(lat1, lon1, lat2, lon2, to_radians=True, earth_radius=6371):

    lat1, lon1, lat2, lon2 = np.radians([lat1, lon1, lat2, lon2])

    a = np.sin((lat2-lat1)/2.0)**2 + \
        np.cos(lat1) * np.cos(lat2) * np.sin((lon2-lon1)/2.0)**2

    return earth_radius * 2 * np.arcsin(np.sqrt(a))

and:和：

["Number_of_neighbours"] = 0
distance = 5
i = 0
y = 0 
for i in range(0, len(test)):
 for y in range(0,len(test)):
 x = haversine(lat1 = test['latitude'].loc[i] , lon1 = test['longitude'].loc[i] , lat2 = test['latitude'].loc[y] , lon2 = test['longitude'].loc[y])
 
 if x <= distance and x != 0:  
  test.at[i,'Number_of_neighbours']= 1 + test.loc[i, 'Number_of_neighbours']

But Jupyter takes for ever to compute the result.但是 Jupyter 需要永远计算结果。 Do you have any suggestion or a more performant solution in mind?您有什么建议或更高效的解决方案吗？ Thank you very much in advance!非常感谢您提前！

Answer 1

Since your data isn't too big, you can rewrite your distance function to take advantage of numpy's broadcasting由于您的数据不是太大，您可以重写distance函数以利用 numpy 的广播

def haversine(lat1, lon1, lat2, lon2, to_radians=True, earth_radius=6371):
    lat1, lon1, lat2, lon2 = np.radians([lat1, lon1, lat2, lon2])

    a = np.sin((lat2-lat1[:,None])/2.0)**2 + \
        np.cos(lat2) * np.cos(lat1[:,None]) * np.sin((lon2[:,None]-lon1)/2.0)**2

    return earth_radius * 2 * np.arcsin(np.sqrt(a))

distance = 5
dist = haversine(test['latitude'],test['longitude'], test['latitude'], test['longitude'])

test['neighbors'] = (dist < distance).sum(-1)

All that took about 7s on my system on 10000 rows.所有这些在我的系统上10000行上花费了大约 7 秒。

如何加快 Pandas 地理距离计算？

问题描述

1 个解决方案

解决方案1
0 2020-10-15 19:43:00

如何加快 Pandas 地理距离计算？

问题描述

1 个解决方案

解决方案1 0 2020-10-15 19:43:00

解决方案1
0 2020-10-15 19:43:00