简体   繁体   English

如何加快 Pandas 地理距离计算?

[英]How can I speed up my Pandas geo distance calculation?

I am a Pandas / Python Beginner and I don't know how I can speed up my code.我是 Pandas / Python 初学者,我不知道如何加速我的代码。 I have a unsorted Pandas Dataframe called test with about 10.000 rows with multiple columns, including latitude and longitude.我有一个名为 test 的未排序 Pandas Dataframe,它有大约 10.000 行和多列,包括纬度和经度。 I want to know, for each row, how many other rows are close by (within distance of a threshold ie 10 km).我想知道,对于每一行,附近有多少其他行(在阈值的距离内,即 10 公里)。

I tried doing that:我试过这样做:

import numpy as np
def distance(lat1, lon1, lat2, lon2, to_radians=True, earth_radius=6371):

    lat1, lon1, lat2, lon2 = np.radians([lat1, lon1, lat2, lon2])

    a = np.sin((lat2-lat1)/2.0)**2 + \
        np.cos(lat1) * np.cos(lat2) * np.sin((lon2-lon1)/2.0)**2

    return earth_radius * 2 * np.arcsin(np.sqrt(a))

and:和:

["Number_of_neighbours"] = 0
distance = 5
i = 0
y = 0 
for i in range(0, len(test)):
 for y in range(0,len(test)):
 x = haversine(lat1 = test['latitude'].loc[i] , lon1 = test['longitude'].loc[i] , lat2 = test['latitude'].loc[y] , lon2 = test['longitude'].loc[y])
 
 if x <= distance and x != 0:  
  test.at[i,'Number_of_neighbours']= 1 + test.loc[i, 'Number_of_neighbours']
  

But Jupyter takes for ever to compute the result.但是 Jupyter 需要永远计算结果。 Do you have any suggestion or a more performant solution in mind?您有什么建议或更高效的解决方案吗? Thank you very much in advance!非常感谢您提前!

Since your data isn't too big, you can rewrite your distance function to take advantage of numpy's broadcasting由于您的数据不是太大,您可以重写distance函数以利用 numpy 的广播

def haversine(lat1, lon1, lat2, lon2, to_radians=True, earth_radius=6371):
    lat1, lon1, lat2, lon2 = np.radians([lat1, lon1, lat2, lon2])

    a = np.sin((lat2-lat1[:,None])/2.0)**2 + \
        np.cos(lat2) * np.cos(lat1[:,None]) * np.sin((lon2[:,None]-lon1)/2.0)**2

    return earth_radius * 2 * np.arcsin(np.sqrt(a))

distance = 5
dist = haversine(test['latitude'],test['longitude'], test['latitude'], test['longitude'])

test['neighbors'] = (dist < distance).sum(-1)

All that took about 7s on my system on 10000 rows.所有这些在我的系统上10000行上花费了大约 7 秒。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM