简体   繁体   English

获取两个 geopandas 数据框几何点之间的距离

[英]Getting the distance between two geopandas data frames geometry points

I am working with spatial data for the first time.我第一次使用空间数据。 I have to compare two dataframes that has lat and long details.我必须比较两个具有经纬度和经度详细信息的数据框。 I have converted both to GeoPandas dataframes.我已将两者都转换为 GeoPandas 数据框。

import pandas as pd
from pandas import DataFrame
import geopandas as gpd
from neighbors import nearest_neighbor


df = pd.DataFrame([[1973,22.525158,88.330775],[1976,72.85136,19.10840],[898,91.78523,26.15012]],columns=['id', 'lat', 'long'])
gdf1 = gpd.GeoDataFrame(df, geometry=gpd.points_from_xy(df.long,df.lat))

df2 = pd.DataFrame([['06c979eaa59f',29.873870,76.965620],['19aedbb2e743',20.087574,76.180045],['5060a3931a43',31.289770,75.572340]],columns=['id','lat','lon']) 
gdf2 = gpd.GeoDataFrame(df2, geometry=gpd.points_from_xy(df2.lon,df2.lat))

My DF1 has 1 million rows and df2 has around 7000 rows.我的 DF1 有 100 万行,而 df2 有大约 7000 行。 I am trying to get the nearest neighbors from DF2 for each record in DF1.我正在尝试为 DF1 中的每条记录从 DF2 获取最近的邻居。

I have tried two methods.我试过两种方法。 Both runs very fast and the results workable.两者都运行得非常快,结果可行。 However, they are not accurate.但是,它们并不准确。

Method 1:方法一:

Please check this link 请检查此链接

In this page, I have used the nearest neighbors method from sklearn.neighbors .在此页面中,我使用了sklearn.neighbors中的最近邻方法。 This returns the results in meters.这将返回以米为单位的结果。 However, when I manually check the distance between the lat long from two data frames, I always finds the nearest neighbor returns 1/4 the distance.但是,当我从两个数据帧手动检查 lat long 之间的距离时,我总是发现最近的邻居返回 1/4 的距离。

For example if the distance returned by the above method is 125 meters, both google map and https://www.geodatasource.com/distance-calculator returns a distance of around 500 meters.比如上面方法返回的距离是125米,google map和https://www.geodatasource.com/distance-calculator都返回500米左右的距离。 The difference in distance keeps fluctuating around 4 times of the returned result.距离的差异一直在返回结果的 4 倍左右波动。

Method 2:方法二:

In the second method I followed the code given in gis.stackexchange.com.在第二种方法中,我遵循了 gis.stackexchange.com 中给出的代码。

https://gis.stackexchange.com/questions/222315/geopandas-find-nearest-point-in-other-dataframe

import itertools
from operator import itemgetter

import geopandas as gpd
import numpy as np
import pandas as pd

from scipy.spatial import cKDTree
from shapely.geometry import Point, LineString

df = pd.DataFrame([[1973,22.525158,88.330775],[1976,72.85136,19.10840],[898,91.78523,26.15012]],columns=['id', 'lat', 'long'])
gdf1 = gpd.GeoDataFrame(df, geometry=gpd.points_from_xy(df.long,df.lat))

df2 = pd.DataFrame([['06c979eaa59f',29.873870,76.965620],['19aedbb2e743',20.087574,76.180045],['5060a3931a43',31.289770,75.572340]],columns=['id','lat','lon']) 
gdf2 = gpd.GeoDataFrame(df2, geometry=gpd.points_from_xy(df2.lon,df2.lat))

In this, I replaced the gpd1 and gpd2 with my own data frames.在此,我用自己的数据框替换了 gpd1 和 gpd2。

def ckdnearest(gdfA, gdfB, gdfB_cols=['id']):
    # resetting the index of gdfA and gdfB here.
    gdfA = gdfA.reset_index(drop=True)
    gdfB = gdfB.reset_index(drop=True)
    A = np.concatenate(
        [np.array(geom.coords) for geom in gdfA.geometry.to_list()])
    B = [np.array(geom.coords) for geom in gdfB.geometry.to_list()]
    B_ix = tuple(itertools.chain.from_iterable(
        [itertools.repeat(i, x) for i, x in enumerate(list(map(len, B)))]))
    B = np.concatenate(B)
    ckd_tree = cKDTree(B)
    dist, idx = ckd_tree.query(A, k=1)
    idx = itemgetter(*idx)(B_ix)
    gdf = pd.concat(
        [gdfA, gdfB.loc[idx, gdfB_cols].reset_index(drop=True),
         pd.Series(dist, name='dist')], axis=1)
    return gdf

c = ckdnearest(gdf1, gdf2)

The above runs very fast and returns the result.以上运行速度非常快并返回结果。 However the returned distance values are atleast 100 times lower than that I get.然而,返回的距离值至少比我得到的低 100 倍。

multiplier: 107.655914乘数:107.655914

在此处输入图像描述

In the above excel pic, the first column indicates the results returned by python, while the second column the results returned by the same website given above.在上面的 excel pic 中,第一列表示 python 返回的结果,而第二列表示上面给出的同一网站返回的结果。 While these approximations in results gets me started, I want accurate results.虽然结果中的这些近似值让我开始,但我想要准确的结果。 How do I compare the two data frames given above and get the most accurate nearest distance for each row in DF1.如何比较上面给出的两个数据框,并为 DF1 中的每一行获取最准确的最近距离。

When working with spatial data, you should be aware that your points coordinates are projected into a plane from a sphere.处理空间数据时,您应该注意点坐标是从球体投影到平面中的。 In mercator projection distance between lat lon points is in degrees, not meters.在墨卡托投影中,纬度点之间的距离以度为单位,而不是米。 And the conversion depends on the latitude of the points, as 1 degree at the equator will be less meters than 1 degree at high latitudes.并且转换取决于点的纬度,因为赤道的 1 度将比高纬度的 1 度少米。

You can check this discussion for possible solutions to this problem: https://gis.stackexchange.com/questions/293310/how-to-use-geoseries-distance-to-get-the-right-answer您可以查看此讨论以了解此问题的可能解决方案: https://gis.stackexchange.com/questions/293310/how-to-use-geoseries-distance-to-get-the-right-answer

To give you an example, one possibility is that you convert your geodataframe to the UTM projection that covers your region.举个例子,一种可能性是您将地理数据框转换为覆盖您所在地区的 UTM 投影。 For example Belgium intersects with UTM zone 31N EPSG:32631 .例如,比利时与 UTM 区域 31N EPSG:32631相交。 Mercator projection has an epsg code EPSG:4326.墨卡托投影有一个 epsg 代码 EPSG:4326。 To convert a GeoDataFrame/GeoSeries you need to provide the CRS when creating it:要转换 GeoDataFrame/GeoSeries,您需要在创建时提供 CRS:

s = gpd.GeoSeries(points, crs=4326)

where points is a list of shapely.geometry.Point其中 points 是shapely.geometry.Point的列表

and then to convert to a given UTM:然后转换为给定的 UTM:

s_utm = s.to_crs(epsg=32631)

Now the distance you will compute between points in s_utm will be in meters.现在,您将在s_utm中计算的点之间的距离以米为单位。

However you need to make sure that your points do fall into the given UTM zone or the result will be inaccurate.但是,您需要确保您的点确实落入给定的 UTM 区域,否则结果将不准确。 The answer I linked suggests other methods that might work as well and could be applied to the whole ensemble of points.我链接的答案表明其他方法也可能有效,并且可以应用于整个点的集合。

You could also try converting to EPSG 32663 (WGS 84 / World Equidistant Cylindrical) which should preserve distances.您也可以尝试转换为 EPSG 32663(WGS 84 / 世界等距圆柱),它应该保持距离。

Another option could be using geopy which allows to compute the geodesic distance with geopy.geodesic.distance另一种选择可能是使用geopy ,它允许使用geopy.geodesic.distance计算测地线距离

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM