![](/img/trans.png)
[英]Distance calculation in pandas dataframe with two lat columns and two long columns
[英]group dataframe by distance to datapoints in different dataframe using long and lat
我有两个数据帧。 其中一个包含多个发电厂及其各自的经纬度位置,每个位于一列。 另一个数据帧包含几个变电站,也有 long 和 lat。 我要做的是将发电厂分配到最近的变电站。
df1 = pd.DataFrame{'ID_pp':['p1','p2','p3','p4'],'x':[12.644881,11.563269, 12.644881, 8.153184], 'y':[48.099206, 48.020081, 48.099206, 49.153766]}
df2 = pd.DataFrame{'ID_ss':['s1','s2','s3','s4'],'x':[9.269, 9.390, 9.317, 10.061], 'y':[55.037, 54.940, 54.716, 54.349]}
我想我需要计算所有点之间的距离,然后对数据框进行分组,但我不确定如何。 我找到了 numpy.linalg.norm() 函数,但它并不适合我。 任何帮助表示赞赏。
我找到了这个解决方案,这基本上正是我所需要的:
import pandas as pd
import geopy.distance
for i,row in test.iterrows(): # A
df1 = row.x, row.y
distances = []
for j,row2 in df2.iterrows(): # B
b = row2.x, row2.y
distances.append(geopy.distance.geodesic(a, b).km)
min_distance = min(distances)
min_index = distances.index(min_distance)
print("A", i, "is closest to B", min_index, min_distance, "km")
它有效,但需要永远,而且我的数据集非常大。 我认为使用 .apply 的方法可能会更快。 有人知道如何将这种方法应用于应用方法吗?
这是使用geopandas
的解决方案。 我不知道这在更大的数据集上有多好。
import geopandas as gpd
import pandas as pd
df1 = pd.DataFrame({'ID_pp':['p1','p2','p3','p4'],'x':[12.644881,11.563269, 12.644881, 8.153184], 'y':[48.099206, 48.020081, 48.099206, 49.153766]})
df2 = pd.DataFrame({'ID_ss':['s1','s2','s3','s4'],'x':[9.269, 9.390, 9.317, 10.061], 'y':[55.037, 54.940, 54.716, 54.349]})
# create GeoDataFrames from the original dfs
gdf1 = gpd.GeoDataFrame(df1[['ID_pp']], geometry=gpd.points_from_xy(df1['x'], df1['y']), crs='EPSG:4326')
gdf2 = gpd.GeoDataFrame(df2[['ID_ss']], geometry=gpd.points_from_xy(df2['x'], df2['y']), crs='EPSG:4326')
# convert to another coordinate reference system for units in metres, EPSG:5243 suits Germany as far as I know
gdf1 = gdf1.to_crs('EPSG:5243')
gdf2 = gdf2.to_crs('EPSG:5243')
gdf2 = gdf2.set_index('ID_ss')
def get_closest_ss(point, other):
s = other.distance(point)
return (s.idxmin(), s.min())
# find ID of closest substation to all power plants
gdf1[['closest_ss', 'distance']] = gdf1.geometry.apply(get_closest_ss, args=(gdf2,)).to_list()
# merge the dataframe with the power plants (gdf1) with the closest substation (gdf2)
gdf = gdf1.merge(gdf2, left_on='closest_ss', right_index=True, suffixes=('', '_ss'))
print(gdf)
# output
ID_pp geometry closest_ss distance \
0 p1 POINT (159807.847 -320153.333) s4 717896.945731
1 p2 POINT (79356.344 -330713.037) s4 711534.096071
2 p3 POINT (159807.847 -320153.333) s4 717896.945731
3 p4 POINT (-171106.060 -202478.708) s4 592470.679838
geometry_ss
0 POINT (-28563.516 372589.227)
1 POINT (-28563.516 372589.227)
2 POINT (-28563.516 372589.227)
3 POINT (-28563.516 372589.227)
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.