简体   繁体   中英

Python iterating through Pandas DataFrame and adding new values calculated with geopy.geocoders Nominatim performance suggestions

I have a.csv file with few millions rows. Rows contain data about location (Street Name) and I'm trying to convert that street name to longitude and latitude as I'm trying to do some geographical analysis. Issue with this is that it takes about 0.5s to process a single row.

I tried improving my code with dynamic programming and saving already calculated values in a dictionary so I can reuse them but this only improved time it takes to calculate a single row to about 0.3s.

Street name does repeat quite a bit and when I'm in about 1000th row I get about 50% of hits from dictionary, unfortunately this still seems to be way too slow. If anyone has an idea of how I can improve this code I'm all ears as current code is more or less useless as it would take a week to iterate through all the rows.

My current code looks like this:

def long_lat_to_csv(dataset, output_file):
    dataset['longitude'] = None
    dataset['latitude'] = None
    geolocator = Nominatim(user_agent="test")
    longlat  = dict()
    area = "New York City, USA"
    for i, row in dataset.iterrows():
        if int(i) % 100 == 0:
            print('current row:', i)
        address = row['Street Name']
        try:
            # if address already calculated:
            long, lat = longlat[address]
            dataset.at[i, 'longitude'] = long
            dataset.at[i, 'latitude'] = lat
        except:
            # if address isn't calculated yet
            try:
                loc = geolocator.geocode(address + ',' + area)
                longlat[address] = [loc.longitude, loc.latitude]
            except:
                # also store addresses that return None
                longlat[address] = None

    dataset.to_csv(output_file, index=False)

Iterating over a dataframe with iterrows is a convenient method but has very poor performance.

Here, you should:

  1. extract unique addresses from the original DataFrame
  2. compute the geographical coordinates of those unique addresses (temporarily forget about pandas here)
  3. use merge to copy those coordinates back to the original DataFrame

It could become:

geolocator = Nominatim(user_agent="test")
area = "New York City, USA"

addresses = dataset['Street Name'].drop_duplicates()
addresses = pd.concat(addresses,
                      pd.DataFrame([loc.longitude, loc.latitude] for address in addresses
                                    for loc in [geolocator.geocode(address + ',' + area)]],
                                   index=addresses.index, columns=['longitude', 'latitude']))  

dataset = dataset.merge(addresses, on=['Street Name'])
dataset.to_csv(output_file, index=False)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM