如何加快 pandas DataFrame 的 geopy 计算速度 10 万个条目？

Question

I have a pandas DataFrame called "orders" with approx.我有一个 pandas DataFrame 称为“订单”，大约有。 100k entries containing address data (zip, city, country).包含地址数据（邮编、城市、国家/地区）的100k 个条目。 For each entry, I would like to calculate the distance to a specific predefined address.对于每个条目，我想计算到特定预定义地址的距离。

So far, I'm looping over the dataframe rows with a for-loop and using geopy to 1. get latitude and longitude values for each entry and 2. calculate the distance to my predefined address.到目前为止，我正在使用 for 循环遍历 dataframe 行并使用geopy来 1.获取每个条目的纬度和经度值，以及 2.计算到我的预定义地址的距离。

Although this works, it takes an awful lot of time (over 15 hours with an average of 2 iterations / second) and I assume that I haven't found the most efficient way yet.虽然这可行，但它需要花费大量时间（超过 15 小时，平均 2 次迭代/秒）并且我认为我还没有找到最有效的方法。 Although I did quite a lot of research and tried out different things like vectorization, these alternatives did not seem to speed up the process (maybe because I didn't implement them in the correct way, as I'm not a very experienced Python user).虽然我做了很多研究并尝试了矢量化等不同的方法，但这些替代方法似乎并没有加快这个过程（可能是因为我没有以正确的方式实现它们，因为我不是一个非常有经验的 Python 用户).

This is my code so far:到目前为止，这是我的代码：

def get_geographic_information():

    latitude = destination_geocode.latitude
    
    longitude = destination_geocode.longitude

    destination_coordinates = (latitude, longitude)

    distance = round(geopy.distance.distance(starting_point_coordinates, destination_coordinates).km, 2)
    
    return latitude, longitude, distance

import geopy
from geopy.geocoders import Nominatim
import geopy.distance

orders["Latitude"] = ""
orders["Longitude"] = ""
orders["Distance"] = ""

geolocator = Nominatim(user_agent="Project01")

starting_point = "my_address"
starting_point_geocode = geolocator.geocode(starting_point, timeout=10000)
starting_point_coordinates = (starting_point_geocode.latitude, starting_point_geocode.longitude)

for index in tqdm(range(len(orders))):
    destination_zip = orders.loc[index, "ZIP"]
    destination_city = orders.loc[index, "City"]
    destination_country = orders.loc[index, "Country"]
        
    destination = destination_zip + " " + destination_city + " " + destination_country
    destination_geocode = geolocator.geocode(destination, timeout=15000)
    
    if destination_geocode != None:
        geographic_information = get_geographic_information()
        
        orders.loc[index, "Latitude"] = geographic_information[0]
        
        orders.loc[index, "Longitude"] = geographic_information[1]
        
        orders.loc[index, "Distance"] = geographic_information[2]
    
    else:
        orders.loc[index, "Latitude"] = "-"
        
        orders.loc[index, "Longitude"] = "-"
        
        orders.loc[index, "Distance"] = "-"

From my previous research, I learned that the for-loop might be the problem, but I haven't managed to replace it yet.从我之前的研究中，我了解到 for 循环可能是问题所在，但我还没有设法替换它。 As this is my first question here, I'd appreciate any constructive feedback.由于这是我在这里的第一个问题，我将不胜感激任何建设性的反馈。 Thanks in advance!提前致谢！

Answer 1

The speed of your script is likely limited by using Nominatim.使用 Nominatim 可能会限制脚本的速度。 They throttle the speed to 1 request per second as per this link:他们按照此链接将速度限制为每秒 1 个请求：

https://operations.osmfoundation.org/policies/nominatim/ https://operations.osmfoundation.org/policies/nominatim/

The only way to speed this script up would be to find a different service that allows bulk requests.加快此脚本速度的唯一方法是找到允许批量请求的不同服务。 Geopy has a list of geocoding services that it currently supports. Geopy 有一个当前支持的地理编码服务列表。 Your best bet would be to look through this list and see if you find a service that handles bulk requests (eg Google V3. That would either allow you to make requests in batches or use a distributed process to speed things up.最好的办法是浏览此列表，看看是否找到处理批量请求的服务（例如 Google V3。这将允许您批量发出请求或使用分布式流程来加快处理速度。

如何加快 pandas DataFrame 的 geopy 计算速度 10 万个条目？

问题描述

1 个解决方案

解决方案1
0 已采纳 2022-04-30 16:10:29

如何加快 pandas DataFrame 的 geopy 计算速度 10 万个条目？

问题描述

1 个解决方案

解决方案1 0 已采纳 2022-04-30 16:10:29

解决方案1
0 已采纳 2022-04-30 16:10:29