简体   繁体   中英

Find closest point in Pandas DataFrames

I am quite new to Python. I have the following table in Postgres. These are Polygon values with four coordinates with same Id with ZONE name I have stored this data in Python dataframe called df1

Id  Order   Lat              Lon            Zone
00001   1   50.6373473  3.075029928          A
00001   2   50.63740441 3.075068636          A
00001   3   50.63744285 3.074951754          A 
00001   4   50.63737839 3.074913884          A 
00002   1   50.6376054  3.0750528            B
00002   2   50.6375896  3.0751209            B
00002   3   50.6374239  3.0750246            B
00002   4   50.6374404  3.0749554            B

I have Json data with Lon and Lat values and I have stored them is python dataframe called df2 .

Lat                  Lon
50.6375524099   3.07507914474
50.6375714407   3.07508201591

My task is to compare df2 Lat and Lon values with four coordinates of each zone in df1 to extract the zone name and add it to df2 .

For instance (50.637552409 3.07507914474) belongs to Zone B .

#This is ID with Zone
df1 = pd.read_sql_query("""SELECT * from "zmap" """,con=engine)
#This is with lat,lon values
df2 = pd.read_sql_query("""SELECT * from "E1" """,con=engine)
df2['latlon'] = zip(df2.lat, df2.lon)
zones = [
["A", [[50.637347297, 3.075029928], [50.637404408, 3.075068636], [50.637442847, 3.074951754],[50.637378390, 3.074913884]]]]
for i in range(0, len(zones)):  # for each zone points
    X = mplPath.Path(np.array(zones[i][1]))
    # find if points are Zones
    Y= X.contains_points(df2.latlon.values.tolist())
    # Label points that are in the current zone
    df2[Y, 'zone'] = zones[i][0]

Currently I have done it manually for Zone 'A'. I need to generate the "Zones" for the coordinates in df2.

This sounds like a good use case for scipy cdist , also discussed here .

import pandas as pd
from scipy.spatial.distance import cdist


data1 = {'Lat': pd.Series([50.6373473,50.63740441,50.63744285,50.63737839,50.6376054,50.6375896,50.6374239,50.6374404]),
         'Lon': pd.Series([3.075029928,3.075068636,3.074951754,3.074913884,3.0750528,3.0751209,3.0750246,3.0749554]),
         'Zone': pd.Series(['A','A','A','A','B','B','B','B'])}

data2 = {'Lat': pd.Series([50.6375524099,50.6375714407]),
         'Lon': pd.Series([3.07507914474,3.07508201591])}


def closest_point(point, points):
    """ Find closest point from a list of points. """
    return points[cdist([point], points).argmin()]

def match_value(df, col1, x, col2):
    """ Match value x from col1 row to value in col2. """
    return df[df[col1] == x][col2].values[0]


df1 = pd.DataFrame(data1)
df2 = pd.DataFrame(data2)

df1['point'] = [(x, y) for x,y in zip(df1['Lat'], df1['Lon'])]
df2['point'] = [(x, y) for x,y in zip(df2['Lat'], df2['Lon'])]

df2['closest'] = [closest_point(x, list(df1['point'])) for x in df2['point']]
df2['zone'] = [match_value(df1, 'point', x, 'Zone') for x in df2['closest']]

print(df2)
#    Lat        Lon       point                           closest                  zone
# 0  50.637552  3.075079  (50.6375524099, 3.07507914474)  (50.6375896, 3.0751209)  B
# 1  50.637571  3.075082  (50.6375714407, 3.07508201591)  (50.6375896, 3.0751209)  B

note that the current title of the post Find closest point in Pandas DataFrames but OP's attempt shows that they are looking for the zone within which a point is found.

It is possible to leverage the geopandas library to do this operation elegantly & efficiently.

Convert the DataFrame into a GeoDataFrame.

Then aggregate the points in df1 to create a polygon. The aggregation operation is called dissolve .

Finally, use a spatial join sjoin with the predicate such that points in df2 are covered by the polygon representing a Zone in zones and output the Lat, Lon & Zone` columns.

# set up
import pandas as pd
import geopandas as gpd

df1 = pd.DataFrame({
  'Id': [1, 1, 1, 1, 2, 2, 2, 2],
  'Order': [1, 2, 3, 4, 1, 2, 3, 4],
  'Lat': [50.6373473, 50.63740441, 50.63744285, 50.63737839, 50.6376054, 50.6375896, 50.6374239, 50.6374404], 
  'Lon': [3.075029928, 3.075068636, 3.074951754, 3.074913884, 3.0750528, 3.0751209, 3.0750246, 3.0749554],
 'Zone': ['A', 'A', 'A', 'A', 'B', 'B', 'B', 'B']
})

df2 = pd.DataFrame({
  'Lat': [50.6375524099, 50.6375714407],
  'Lon': [3.07507914474, 3.07508201591] 
})

# convert to GeoDataFrame
df1 = gpd.GeoDataFrame(df1, geometry=gpd.points_from_xy(df1.Lon, df1.Lat))
df2 = gpd.GeoDataFrame(df2, geometry=gpd.points_from_xy(df2.Lon, df2.Lat))

# aggregate & merge
zones = df1.dissolve(by='Zone').convex_hull.rename('geometry').reset_index()
merged = df2.sjoin(zones, how='left', predicate='covered_by')

# output
output_columns = ['Lat', 'Lon', 'Zone']
merged[output_columns]

this outputs:

         Lat       Lon Zone
0  50.637552  3.075079    B
1  50.637571  3.075082    B

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM