简体   繁体   English

如何加快 pandas DataFrame 的 geopy 计算速度 10 万个条目?

[英]How do I speed up geopy calculations for a pandas DataFrame with approx. 100k entries?

I have a pandas DataFrame called "orders" with approx.我有一个 pandas DataFrame 称为“订单”,大约有。 100k entries containing address data (zip, city, country).包含地址数据(邮编、城市、国家/地区)的100k 个条目 For each entry, I would like to calculate the distance to a specific predefined address.对于每个条目,我想计算到特定预定义地址的距离

So far, I'm looping over the dataframe rows with a for-loop and using geopy to 1. get latitude and longitude values for each entry and 2. calculate the distance to my predefined address.到目前为止,我正在使用 for 循环遍历 dataframe 行并使用geopy来 1.获取每个条目的纬度和经度值,以及 2.计算到我的预定义地址的距离

Although this works, it takes an awful lot of time (over 15 hours with an average of 2 iterations / second) and I assume that I haven't found the most efficient way yet.虽然这可行,但它需要花费大量时间(超过 15 小时,平均 2 次迭代/秒)并且我认为我还没有找到最有效的方法。 Although I did quite a lot of research and tried out different things like vectorization, these alternatives did not seem to speed up the process (maybe because I didn't implement them in the correct way, as I'm not a very experienced Python user).虽然我做了很多研究并尝试了矢量化等不同的方法,但这些替代方法似乎并没有加快这个过程(可能是因为我没有以正确的方式实现它们,因为我不是一个非常有经验的 Python 用户).

This is my code so far:到目前为止,这是我的代码:

def get_geographic_information():

    latitude = destination_geocode.latitude
    
    longitude = destination_geocode.longitude

    destination_coordinates = (latitude, longitude)

    distance = round(geopy.distance.distance(starting_point_coordinates, destination_coordinates).km, 2)
    
    return latitude, longitude, distance
import geopy
from geopy.geocoders import Nominatim
import geopy.distance

orders["Latitude"] = ""
orders["Longitude"] = ""
orders["Distance"] = ""

geolocator = Nominatim(user_agent="Project01")

starting_point = "my_address"
starting_point_geocode = geolocator.geocode(starting_point, timeout=10000)
starting_point_coordinates = (starting_point_geocode.latitude, starting_point_geocode.longitude)

for index in tqdm(range(len(orders))):
    destination_zip = orders.loc[index, "ZIP"]
    destination_city = orders.loc[index, "City"]
    destination_country = orders.loc[index, "Country"]
        
    destination = destination_zip + " " + destination_city + " " + destination_country
    destination_geocode = geolocator.geocode(destination, timeout=15000)
    
    if destination_geocode != None:
        geographic_information = get_geographic_information()
        
        orders.loc[index, "Latitude"] = geographic_information[0]
        
        orders.loc[index, "Longitude"] = geographic_information[1]
        
        orders.loc[index, "Distance"] = geographic_information[2]
    
    else:
        orders.loc[index, "Latitude"] = "-"
        
        orders.loc[index, "Longitude"] = "-"
        
        orders.loc[index, "Distance"] = "-"

From my previous research, I learned that the for-loop might be the problem, but I haven't managed to replace it yet.从我之前的研究中,我了解到 for 循环可能是问题所在,但我还没有设法替换它。 As this is my first question here, I'd appreciate any constructive feedback.由于这是我在这里的第一个问题,我将不胜感激任何建设性的反馈。 Thanks in advance!提前致谢!

The speed of your script is likely limited by using Nominatim.使用 Nominatim 可能会限制脚本的速度。 They throttle the speed to 1 request per second as per this link:他们按照此链接将速度限制为每秒 1 个请求:

https://operations.osmfoundation.org/policies/nominatim/ https://operations.osmfoundation.org/policies/nominatim/

The only way to speed this script up would be to find a different service that allows bulk requests.加快此脚本速度的唯一方法是找到允许批量请求的不同服务。 Geopy has a list of geocoding services that it currently supports. Geopy 有一个当前支持的地理编码服务列表 Your best bet would be to look through this list and see if you find a service that handles bulk requests (eg Google V3. That would either allow you to make requests in batches or use a distributed process to speed things up.最好的办法是浏览此列表,看看是否找到处理批量请求的服务(例如 Google V3。这将允许您批量发出请求或使用分布式流程来加快处理速度。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 如何加快将 function 应用到大型 pandas dataframe 的速度? - How do I speed up applying a function to a large pandas dataframe? 如何加快python中的迭代计算速度? - How do I speed up iterative calculations in python? 将具有 100k 行的工作簿的工作表复制到当前工作簿的速度提高 10 倍 - Speed up 10x copying a workbook's worksheet which has 100k rows to the current workbook 将 100k 行 pyspark df 转换为 pandas df - Convert 100k row pyspark df to pandas df 如何通过具有 100K 行的两个不同数据帧改进我的代码迭代以降低 Python 中的处理速度? - how to improve my code iteration through two different data frame with 100K rows to decrease processing speed in python? 我可以大约多少钱。 使用 dvc 减少磁盘体积? - By how much can i approx. reduce disk volume by using dvc? 我有10万个项目要处理的列表。 如何使用线程以并行方式处理此问题? - I have a list of 100K items to process. How to process this in concurrent manner using threads? 如何加快大型 pandas dataframe 的数据标记速度? - How can i speed up data labelling for a large pandas dataframe? 如何在熊猫数据框中进行复杂的计算 - how to do complex calculations in pandas dataframe 如何加快涉及 pandas 中前一行的计算? - How to speed up calculations involving previous row in pandas?
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM