[英]Find closest point in Pandas DataFrames
I am quite new to Python. I have the following table in Postgres.我对 Python 很陌生。我在 Postgres 中有下表。 These are Polygon values with four coordinates with same Id
with ZONE
name I have stored this data in Python dataframe called df1
这些是具有四个坐标的多边形值,具有相同的Id
和ZONE
名称我已将此数据存储在 Python dataframe 中,称为df1
Id Order Lat Lon Zone
00001 1 50.6373473 3.075029928 A
00001 2 50.63740441 3.075068636 A
00001 3 50.63744285 3.074951754 A
00001 4 50.63737839 3.074913884 A
00002 1 50.6376054 3.0750528 B
00002 2 50.6375896 3.0751209 B
00002 3 50.6374239 3.0750246 B
00002 4 50.6374404 3.0749554 B
I have Json data with Lon
and Lat
values and I have stored them is python dataframe called df2
.我有 Json 数据,其中包含Lon
和Lat
值,我将它们存储为 python dataframe 称为df2
。
Lat Lon
50.6375524099 3.07507914474
50.6375714407 3.07508201591
My task is to compare df2
Lat
and Lon
values with four coordinates of each zone in df1
to extract the zone name and add it to df2
.我的任务是将df2
Lat
和Lon
值与df1
中每个区域的四个坐标进行比较,以提取区域名称并将其添加到df2
。
For instance (50.637552409 3.07507914474)
belongs to Zone B
.例如(50.637552409 3.07507914474)
属于Zone B
。
#This is ID with Zone
df1 = pd.read_sql_query("""SELECT * from "zmap" """,con=engine)
#This is with lat,lon values
df2 = pd.read_sql_query("""SELECT * from "E1" """,con=engine)
df2['latlon'] = zip(df2.lat, df2.lon)
zones = [
["A", [[50.637347297, 3.075029928], [50.637404408, 3.075068636], [50.637442847, 3.074951754],[50.637378390, 3.074913884]]]]
for i in range(0, len(zones)): # for each zone points
X = mplPath.Path(np.array(zones[i][1]))
# find if points are Zones
Y= X.contains_points(df2.latlon.values.tolist())
# Label points that are in the current zone
df2[Y, 'zone'] = zones[i][0]
Currently I have done it manually for Zone 'A'.目前我已经为“A”区手动完成了它。 I need to generate the "Zones" for the coordinates in df2.我需要为 df2 中的坐标生成“区域”。
This sounds like a good use case for scipy cdist , also discussed here . 这听起来像是scipy cdist的好用例 ,在这里也进行了讨论。
import pandas as pd
from scipy.spatial.distance import cdist
data1 = {'Lat': pd.Series([50.6373473,50.63740441,50.63744285,50.63737839,50.6376054,50.6375896,50.6374239,50.6374404]),
'Lon': pd.Series([3.075029928,3.075068636,3.074951754,3.074913884,3.0750528,3.0751209,3.0750246,3.0749554]),
'Zone': pd.Series(['A','A','A','A','B','B','B','B'])}
data2 = {'Lat': pd.Series([50.6375524099,50.6375714407]),
'Lon': pd.Series([3.07507914474,3.07508201591])}
def closest_point(point, points):
""" Find closest point from a list of points. """
return points[cdist([point], points).argmin()]
def match_value(df, col1, x, col2):
""" Match value x from col1 row to value in col2. """
return df[df[col1] == x][col2].values[0]
df1 = pd.DataFrame(data1)
df2 = pd.DataFrame(data2)
df1['point'] = [(x, y) for x,y in zip(df1['Lat'], df1['Lon'])]
df2['point'] = [(x, y) for x,y in zip(df2['Lat'], df2['Lon'])]
df2['closest'] = [closest_point(x, list(df1['point'])) for x in df2['point']]
df2['zone'] = [match_value(df1, 'point', x, 'Zone') for x in df2['closest']]
print(df2)
# Lat Lon point closest zone
# 0 50.637552 3.075079 (50.6375524099, 3.07507914474) (50.6375896, 3.0751209) B
# 1 50.637571 3.075082 (50.6375714407, 3.07508201591) (50.6375896, 3.0751209) B
note that the current title of the post Find closest point in Pandas DataFrames
but OP's attempt shows that they are looking for the zone within which a point is found.请注意帖子的当前标题Find closest point in Pandas DataFrames
但 OP 的尝试表明他们正在寻找找到点的区域。
It is possible to leverage the geopandas library to do this operation elegantly & efficiently.可以利用 geopandas 库优雅高效地执行此操作。
Convert the DataFrame into a GeoDataFrame.将 DataFrame 转换为 GeoDataFrame。
Then aggregate the points in df1
to create a polygon.然后聚合df1
中的点以创建多边形。 The aggregation operation is called dissolve
.聚合操作称为dissolve
。
Finally, use a spatial join sjoin
with the predicate such that points in df2 are covered by the polygon representing a Zone
in zones
and output the Lat,
Lon &
Zone` columns.最后,使用带有谓词的空间连接sjoin
,使得 df2 中的点被表示Zone
中的zones
的多边形和 output Lat,
经度&
区域列覆盖。
# set up
import pandas as pd
import geopandas as gpd
df1 = pd.DataFrame({
'Id': [1, 1, 1, 1, 2, 2, 2, 2],
'Order': [1, 2, 3, 4, 1, 2, 3, 4],
'Lat': [50.6373473, 50.63740441, 50.63744285, 50.63737839, 50.6376054, 50.6375896, 50.6374239, 50.6374404],
'Lon': [3.075029928, 3.075068636, 3.074951754, 3.074913884, 3.0750528, 3.0751209, 3.0750246, 3.0749554],
'Zone': ['A', 'A', 'A', 'A', 'B', 'B', 'B', 'B']
})
df2 = pd.DataFrame({
'Lat': [50.6375524099, 50.6375714407],
'Lon': [3.07507914474, 3.07508201591]
})
# convert to GeoDataFrame
df1 = gpd.GeoDataFrame(df1, geometry=gpd.points_from_xy(df1.Lon, df1.Lat))
df2 = gpd.GeoDataFrame(df2, geometry=gpd.points_from_xy(df2.Lon, df2.Lat))
# aggregate & merge
zones = df1.dissolve(by='Zone').convex_hull.rename('geometry').reset_index()
merged = df2.sjoin(zones, how='left', predicate='covered_by')
# output
output_columns = ['Lat', 'Lon', 'Zone']
merged[output_columns]
this outputs:这输出:
Lat Lon Zone
0 50.637552 3.075079 B
1 50.637571 3.075082 B
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.