简体   繁体   English

Python Geopy Nominatim 请求过多

[英]Python Geopy Nominatim too many requests

["

The following script works perfectly with a file containing 2 rows but when I tried 2500 row file, I got 429 exceptions.<\/i>以下脚本与包含 2 行的文件完美配合,但是当我尝试 2500 行文件时,我得到了 429 个异常。<\/b> So, I increased the query time to 5 seconds.<\/i>因此,我将查询时间增加到 5 秒。<\/b> I also filled the user agent.<\/i>我还填写了用户代理。<\/b> After unsuccessful attempts, I connected to VPN to change 'fresh' but I got 429 errors again.<\/i>尝试不成功后,我连接到 VPN 以更改“新鲜”,但我再次收到 429 错误。<\/b> Is there something I am missing here?<\/i>我在这里缺少什么吗?<\/b> Nominatim policy specifies no more connections than 1 per second, I am doing one per 5 seconds...any help would be helpful!<\/i> Nominatim 政策规定的连接数不超过每秒 1 个,我每 5 秒做一个……任何帮助都会有所帮助!<\/b><\/p>

from geopy.geocoders import Nominatim
import pandas
from functools import partial

from geopy.extra.rate_limiter import RateLimiter

nom = Nominatim(user_agent="xxx@gmail.com")
geocode = RateLimiter(nom.geocode, min_delay_seconds=5)


df=pandas.read_csv('Book1.csv', engine='python')
df["ALL"] = df['Address'].apply(partial(nom.geocode, timeout=1000, language='en'))
df["Latitude"] = df["ALL"].apply(lambda x: x.latitude if x != None else None)
df["Longitude"] = df["ALL"].apply(lambda x: x.longitude if x != None else None)

writer = pandas.ExcelWriter('Book1.xlsx')
df.to_excel(writer, 'new_sheet')
writer.save()

I've done reverse geocoding of ~10K different lat-lon combinations in less than a day.我在不到一天的时间里完成了大约 10K 不同的经纬度组合的反向地理编码。 Nominatim doesn't like bulk queries, so the idea is to prevent looking like one. Nominatim 不喜欢批量查询,所以这个想法是为了防止看起来像一个。 Here's what I suggest:这是我的建议:

  1. Make sure that you only query unique items.确保您只查询独特的项目。 I've found that repeated queries for the same lat-lon combination is blocked by Nominatim.我发现 Nominatim 阻止了对相同经纬度组合的重复查询。 The same can be true for addresses.地址也是如此。 You can use unq_address = df['address'].unique() and then make a query using that series.您可以使用unq_address = df['address'].unique()然后使用该系列进行查询。 You could even end up with less addresses.你甚至可以得到更少的地址。

  2. The time between queries should be random .查询之间的时间应该是随机的。 I also set the user_agent to have a random number every time.我还将 user_agent 设置为每次都有一个随机数。 In my case, I use the following code:就我而言,我使用以下代码:

     from time import sleep from random import randint from geopy.geocoders import Nominatim from geopy.exc import GeocoderTimedOut, GeocoderServiceError user_agent = 'user_me_{}'.format(randint(10000,99999)) geolocator = Nominatim(user_agent=user_agent) def reverse_geocode(geolocator, latlon, sleep_sec): try: return geolocator.reverse(latlon) except GeocoderTimedOut: logging.info('TIMED OUT: GeocoderTimedOut: Retrying...') sleep(randint(1*100,sleep_sec*100)/100) return reverse_geocode(geolocator, latlon, sleep_sec) except GeocoderServiceError as e: logging.info('CONNECTION REFUSED: GeocoderServiceError encountered.') logging.error(e) return None except Exception as e: logging.info('ERROR: Terminating due to exception {}'.format(e)) return None

I find that the line sleep(randint(1*100,sleep_sec*100)/100) does the trick for me.我发现sleep(randint(1*100,sleep_sec*100)/100)行对我有用。

After some research it turns out Nominatim has a 1000 query limit per day so the script was trying to do more than 1k.经过一些研究,事实证明 Nominatim 每天有 1000 个查询限制,因此该脚本试图执行超过 1k 的操作。

https://getlon.lat/ https://getlon.lat/

Regarding geocoding with geopy RateLimiter and Nominatim , I have put together the following function which works well.关于使用 geopy RateLimiter 和 Nominatim 进行地理编码,我将以下功能组合在一起,效果很好。 It breaks down large files (in this case pandas dataframe) into batches.它将大文件(在本例中为 pandas 数据帧)分解成批次。 There is also a try:except clause to catch errors which returns the partial dataset as well as the final row number of the last batch executed, which you then reuse in the last function parameter unique_array_pos.还有一个 try:except 子句来捕获错误,它返回部分数据集以及最后执行的批处理的最终行号,然后您可以在最后一个函数参数 unique_array_pos 中重用它们。 Useful to keep partial results and resume from where it stopped.有助于保留部分结果并从停止的地方恢复。

Parameters:参数:

  • Wait_time_batch: wait time between batches Wait_time_batch:批次之间的等待时间
  • wait_time_retries: wait time between retries wait_time_retries:重试之间的等待时间
  • data_df: pandas DataFrame with one column containing the full address data_df:pandas DataFrame,其中一列包含完整地址
  • batch_size: the size of each batch batch_size:每批的大小
  • address_column: the column in the df where the address is stored address_column:df中存储地址的列
  • unique_array_pos: in case the function errors out, the point from which to resume the geocoding. unique_array_pos:如果函数出错,恢复地理编码的点。

Through trial and error a good batch size is 200, even for large datasets it ensures the geocoding does not error out.通过反复试验,一个好的批量大小是 200,即使对于大型数据集,它也可以确保地理编码不会出错。

I am not using tqdm for 2 reasons: could not get it to work on Juyter Lab and I find the row by row output just as effective.没有使用 tqdm有两个原因:无法让它在 Juyter Lab 上工作,而且我发现逐行输出同样有效。

Enjoy!享受!

def batch_geocode(wait_time_batch,wait_time_retries,data_df,batch_size,address_column, unique_array_pos):
unique = data_df[address_column].unique() #get unique addresses from dataframe
un_size = len(unique)
n_iter = math.ceil(un_size / batch_size) #compute the n of iterations necessary
start_time =time.clock()
print('size: '+str(un_size))
print('n_iter: '+str(n_iter))
final = np.empty((0,2),dtype=object)
    
for i in range(unique_array_pos,n_iter,1):
    try:

        start_iter  = time.clock()
        if i ==0:start = i
        else: start = i*batch_size
        print('batch:'+str(i)+',row number:'+str(start))
        geolocator = Nominatim(user_agent='trial'+str(randint(0,1000)))
        geocode = RateLimiter(geolocator.geocode,max_retries =3,error_wait_seconds=randint(1*100,2*100)/100)
        temp1= unique[start:(i+1)* batch_size]
        #print(temp1.shape)
        loc = np.array([geocode(x) for x in temp1])
        #print(loc)
        #print(loc.shape)
        #print(loc)
        temp2 = np.c_[temp1,loc]
        #print(temp2.shape)
        final = np.append(final,temp2,axis=0)
        sleep(randint(1*100,wait_time_batch*100)/100)
        print(f'iteration time: {time.clock()-start_iter: .2f}' + f', total time: {time.clock()-start_time:.2f}')
        #print('iter: '+str(time.clock()-start_iter)+',total: '+str(time.clock()-start_time))
    except Exception as e: 
        print(e)
        print('failed execution, last position in unique array: '+str(unique_array_pos+i*batch_size))
        return pd.DataFrame(final,columns =['address','location'])
return pd.DataFrame(final,columns =['address','location'])

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM