I have a.csv file with few millions rows. Rows contain data about location (Street Name) and I'm trying to convert that street name to longitude and latitude as I'm trying to do some geographical analysis. Issue with this is that it takes about 0.5s to process a single row.
I tried improving my code with dynamic programming and saving already calculated values in a dictionary so I can reuse them but this only improved time it takes to calculate a single row to about 0.3s.
Street name does repeat quite a bit and when I'm in about 1000th row I get about 50% of hits from dictionary, unfortunately this still seems to be way too slow. If anyone has an idea of how I can improve this code I'm all ears as current code is more or less useless as it would take a week to iterate through all the rows.
My current code looks like this:
def long_lat_to_csv(dataset, output_file):
dataset['longitude'] = None
dataset['latitude'] = None
geolocator = Nominatim(user_agent="test")
longlat = dict()
area = "New York City, USA"
for i, row in dataset.iterrows():
if int(i) % 100 == 0:
print('current row:', i)
address = row['Street Name']
try:
# if address already calculated:
long, lat = longlat[address]
dataset.at[i, 'longitude'] = long
dataset.at[i, 'latitude'] = lat
except:
# if address isn't calculated yet
try:
loc = geolocator.geocode(address + ',' + area)
longlat[address] = [loc.longitude, loc.latitude]
except:
# also store addresses that return None
longlat[address] = None
dataset.to_csv(output_file, index=False)
Iterating over a dataframe with iterrows
is a convenient method but has very poor performance.
Here, you should:
merge
to copy those coordinates back to the original DataFrameIt could become:
geolocator = Nominatim(user_agent="test")
area = "New York City, USA"
addresses = dataset['Street Name'].drop_duplicates()
addresses = pd.concat(addresses,
pd.DataFrame([loc.longitude, loc.latitude] for address in addresses
for loc in [geolocator.geocode(address + ',' + area)]],
index=addresses.index, columns=['longitude', 'latitude']))
dataset = dataset.merge(addresses, on=['Street Name'])
dataset.to_csv(output_file, index=False)
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.