简体   繁体   中英

Find closest (lat/lon) observation from one pandas dataframe for each observation in a second dataframe

Problem summary:

I have two dataframes. The first dataframe (df1) is relatively small (Nearly always less than 100 observations, typically less than 50), with a set of point identifiers and their lat/lon coordinates. The second dataframe (df2) is very large (hundreds of thousands of observations) which also has lat/lon coordinates. I wish to create two new columns in df2: the first has the identifier of the nearest point from df1, the second has the distance to that point. My current method is quite clunky and I think can be optimized significantly. For additional context, there is a single df1 (small dataframe), but I will be repeating this process for multiple df2s (large dataframes).

Setup/Sample Data:

# imports:
import pandas as pd
import geopy.distance
from faker import Faker

# creating sample data:
Faker.seed(0)
fake=Faker()

id1=[]
lat1=[]
lon1=[]
id2=[]
lat2=[]
lon2=[]
length1=10 # length of df1
length2=100 # length of df2

for x in range(length1):
    a=fake.local_latlng()
    id1.append(x)
    lat1.append(float(a[0]))
    lon1.append(float(a[1]))
for x in range(length2):
    a=fake.local_latlng()
    id2.append(x)
    lat2.append(float(a[0]))
    lon2.append(float(a[1]))

dict1={
    'loc_id' : id1,
    'lat' : lat1,
    'lon' : lon1,
    }

dict2={
    'point_id' : id2,
    'lat' : lat2,
    'lon' : lon2,
    }

df1=pd.DataFrame(dict1)
df2=pd.DataFrame(dict2)

Current Solution:

# calculating distances:
for x in range(len(df1)):
    loc_id=df1.iloc[x]['loc_id']
    pt1=(df1.iloc[x]['lat'],df1.iloc[x]['lon'])
    for y in range(len(df2)):
        pt2=(df2.iloc[y]['lat'],df2.iloc[y]['lon'])
        dist=geopy.distance.distance(pt1,pt2).miles
        df2.loc[y,x]=dist

# determining minimum distance and label:
temp_cols=list(range(len(df1)))
df2['min_dist']=df2[temp_cols].min(axis=1)
df2['min_loc']=df2[temp_cols].idxmin(axis=1)

# removing extra columns:
df2=df2.drop(temp_cols,axis=1)
print(df2.head())

Possible solutions:

This code is quite obviously slow, since I compute the distance for each pair of points. Conceptually, I think this can be improved, but I'm having trouble implementing the improvements. Some ideas:

  1. vectorize operations. This accepted answer seems to indicate that operations over vectors are faster, but I don't know how to implement the geopy.distance.distance() function over a vector (or if it's possible).
  2. eliminate points by comparing which are "dominated" so to speak. Such that if, for example, a point is larger on both lat/lon than another, I might be able to eliminate it when comparing to a point that is smaller in both lat/lon points from the set I have to check against. I imagine this increases work/processing on the front end, but pays off in the end by reducing the number of points I check for each point. Figuring out that algorithm is not obvious to me though.
  3. I might be able to do some sort of binning of points into groups that are nearby one another, therefore getting smaller sets of candidates to compare to one another. Maybe it's possible to figure out which is the closest point(s) before calculating distances. The danger is that some of the points in df1 might be quite close together as well.

Additional details: The odds of two points having identical distances is small, and I am happy randomly selecting any point that ties for closest if that should come up.

Based upon k-nearest neighbor using Balltree

Approach

  1. Create k-nearest neighbor tree for lat/lon for df1. Use BallTree since it allows customized distance functions such as geopy.distance.distance
  2. For each lat/lon in df2, find its closest point in the tree from 1
  3. Note: Balltree would be even faster if we used a builtin distance function such as Haversine

Code

import pandas as pd
import numpy as np
from faker import Faker
from sklearn.neighbors import BallTree
from geopy import distance

import functools
import time

# Timing Decorator
def timer(func):
    """Print the runtime of the decorated function"""
    @functools.wraps(func)
    def wrapper_timer(*args, **kwargs):
        start_time = time.perf_counter()    # 1
        value = func(*args, **kwargs)
        end_time = time.perf_counter()      # 2
        run_time = end_time - start_time    # 3
        print(f"Finished {func.__name__!r} in {run_time:.4f} secs")
        return value
    return wrapper_timer

def generate_data(length):
    '''
        Genearte data frame with random lat/lon data 
        Data with same length is always identical since we use same initial seed
    '''
    Faker.seed(0)
    fake = Faker()

    id = list(range(length))
    # First two fields of fake.local_latlng are lat & lon as string
    # Generate vector of fake.local_latlng then unpack out lat/lon array
    lat, lon = list(zip(*[fake.local_latlng() for _ in range(length)]))[:2]
    
    # Convert strings to float
    lat = [float(x) for x in lat]
    lon = [float(x) for x in lon]
    
    return pd.DataFrame({'point_id':id, 'lat':lat, 'lon':lon})

def generate_balltree(df):
    '''
        Generate Balltree using customize distance (i.e. Geodesic distance)
    '''
    return  BallTree(df[['lat', 'lon']].values, metric=lambda u, v: distance.distance(u, v).miles)

@timer
def find_matches(tree, df):
    '''
        Find closest matches in df to items in tree
    '''
    distances, indices = tree.query(df[['lat', 'lon']].values, k = 1)
    df['min_dist'] = distances
    df['min_loc'] = indices
    
    return df

@timer
def find_min_op(df1, df2):
    ' OP solution (to compare timing) '

    for x in range(len(df1)):
        #loc_id=df1.iloc[x]['loc_id'] # not used
        pt1=(df1.iloc[x]['lat'],df1.iloc[x]['lon'])
        for y in range(len(df2)):
            pt2=(df2.iloc[y]['lat'],df2.iloc[y]['lon'])
            dist=distance.distance(pt1,pt2).miles
            df2.loc[y,x]=dist

    # determining minimum distance and label:
    temp_cols=list(range(len(df1)))
    df2['min_dist']=df2[temp_cols].min(axis=1)
    df2['min_loc']=df2[temp_cols].idxmin(axis=1)

    # removing extra columns:
    df2 = df2.drop(columns = temp_cols)
    
    return df2

Tests

df1 = 100 elements, df2 = 1000 elements

l1, l2 = 100, 1000
df1 = generate_data(l1)
df2 = generate_data(l2)
tree = generate_balltree(df1)
find_matches(tree, df2)

df2 = generate_data(l2)  # Regenerate df2 for next test since find_matches modified it
find_min_op(df1, df2)

Output

Finished 'find_matches' in 32.1677 secs
Finished 'find_min_op' in 147.7042 secs

Thus, this approach ~5x faster in this test

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM