简体   繁体   中英

Python Geopy Nominatim too many requests

["

The following script works perfectly with a file containing 2 rows but when I tried 2500 row file, I got 429 exceptions.<\/i>

from geopy.geocoders import Nominatim
import pandas
from functools import partial

from geopy.extra.rate_limiter import RateLimiter

nom = Nominatim(user_agent="xxx@gmail.com")
geocode = RateLimiter(nom.geocode, min_delay_seconds=5)


df=pandas.read_csv('Book1.csv', engine='python')
df["ALL"] = df['Address'].apply(partial(nom.geocode, timeout=1000, language='en'))
df["Latitude"] = df["ALL"].apply(lambda x: x.latitude if x != None else None)
df["Longitude"] = df["ALL"].apply(lambda x: x.longitude if x != None else None)

writer = pandas.ExcelWriter('Book1.xlsx')
df.to_excel(writer, 'new_sheet')
writer.save()

I've done reverse geocoding of ~10K different lat-lon combinations in less than a day. Nominatim doesn't like bulk queries, so the idea is to prevent looking like one. Here's what I suggest:

  1. Make sure that you only query unique items. I've found that repeated queries for the same lat-lon combination is blocked by Nominatim. The same can be true for addresses. You can use unq_address = df['address'].unique() and then make a query using that series. You could even end up with less addresses.

  2. The time between queries should be random . I also set the user_agent to have a random number every time. In my case, I use the following code:

     from time import sleep from random import randint from geopy.geocoders import Nominatim from geopy.exc import GeocoderTimedOut, GeocoderServiceError user_agent = 'user_me_{}'.format(randint(10000,99999)) geolocator = Nominatim(user_agent=user_agent) def reverse_geocode(geolocator, latlon, sleep_sec): try: return geolocator.reverse(latlon) except GeocoderTimedOut: logging.info('TIMED OUT: GeocoderTimedOut: Retrying...') sleep(randint(1*100,sleep_sec*100)/100) return reverse_geocode(geolocator, latlon, sleep_sec) except GeocoderServiceError as e: logging.info('CONNECTION REFUSED: GeocoderServiceError encountered.') logging.error(e) return None except Exception as e: logging.info('ERROR: Terminating due to exception {}'.format(e)) return None

I find that the line sleep(randint(1*100,sleep_sec*100)/100) does the trick for me.

After some research it turns out Nominatim has a 1000 query limit per day so the script was trying to do more than 1k.

https://getlon.lat/

Regarding geocoding with geopy RateLimiter and Nominatim , I have put together the following function which works well. It breaks down large files (in this case pandas dataframe) into batches. There is also a try:except clause to catch errors which returns the partial dataset as well as the final row number of the last batch executed, which you then reuse in the last function parameter unique_array_pos. Useful to keep partial results and resume from where it stopped.

Parameters:

  • Wait_time_batch: wait time between batches
  • wait_time_retries: wait time between retries
  • data_df: pandas DataFrame with one column containing the full address
  • batch_size: the size of each batch
  • address_column: the column in the df where the address is stored
  • unique_array_pos: in case the function errors out, the point from which to resume the geocoding.

Through trial and error a good batch size is 200, even for large datasets it ensures the geocoding does not error out.

I am not using tqdm for 2 reasons: could not get it to work on Juyter Lab and I find the row by row output just as effective.

Enjoy!

def batch_geocode(wait_time_batch,wait_time_retries,data_df,batch_size,address_column, unique_array_pos):
unique = data_df[address_column].unique() #get unique addresses from dataframe
un_size = len(unique)
n_iter = math.ceil(un_size / batch_size) #compute the n of iterations necessary
start_time =time.clock()
print('size: '+str(un_size))
print('n_iter: '+str(n_iter))
final = np.empty((0,2),dtype=object)
    
for i in range(unique_array_pos,n_iter,1):
    try:

        start_iter  = time.clock()
        if i ==0:start = i
        else: start = i*batch_size
        print('batch:'+str(i)+',row number:'+str(start))
        geolocator = Nominatim(user_agent='trial'+str(randint(0,1000)))
        geocode = RateLimiter(geolocator.geocode,max_retries =3,error_wait_seconds=randint(1*100,2*100)/100)
        temp1= unique[start:(i+1)* batch_size]
        #print(temp1.shape)
        loc = np.array([geocode(x) for x in temp1])
        #print(loc)
        #print(loc.shape)
        #print(loc)
        temp2 = np.c_[temp1,loc]
        #print(temp2.shape)
        final = np.append(final,temp2,axis=0)
        sleep(randint(1*100,wait_time_batch*100)/100)
        print(f'iteration time: {time.clock()-start_iter: .2f}' + f', total time: {time.clock()-start_time:.2f}')
        #print('iter: '+str(time.clock()-start_iter)+',total: '+str(time.clock()-start_time))
    except Exception as e: 
        print(e)
        print('failed execution, last position in unique array: '+str(unique_array_pos+i*batch_size))
        return pd.DataFrame(final,columns =['address','location'])
return pd.DataFrame(final,columns =['address','location'])

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM