简体   繁体   中英

Getting the distance between two geopandas data frames geometry points

I am working with spatial data for the first time. I have to compare two dataframes that has lat and long details. I have converted both to GeoPandas dataframes.

import pandas as pd
from pandas import DataFrame
import geopandas as gpd
from neighbors import nearest_neighbor


df = pd.DataFrame([[1973,22.525158,88.330775],[1976,72.85136,19.10840],[898,91.78523,26.15012]],columns=['id', 'lat', 'long'])
gdf1 = gpd.GeoDataFrame(df, geometry=gpd.points_from_xy(df.long,df.lat))

df2 = pd.DataFrame([['06c979eaa59f',29.873870,76.965620],['19aedbb2e743',20.087574,76.180045],['5060a3931a43',31.289770,75.572340]],columns=['id','lat','lon']) 
gdf2 = gpd.GeoDataFrame(df2, geometry=gpd.points_from_xy(df2.lon,df2.lat))

My DF1 has 1 million rows and df2 has around 7000 rows. I am trying to get the nearest neighbors from DF2 for each record in DF1.

I have tried two methods. Both runs very fast and the results workable. However, they are not accurate.

Method 1:

Please check this link

In this page, I have used the nearest neighbors method from sklearn.neighbors . This returns the results in meters. However, when I manually check the distance between the lat long from two data frames, I always finds the nearest neighbor returns 1/4 the distance.

For example if the distance returned by the above method is 125 meters, both google map and https://www.geodatasource.com/distance-calculator returns a distance of around 500 meters. The difference in distance keeps fluctuating around 4 times of the returned result.

Method 2:

In the second method I followed the code given in gis.stackexchange.com.

https://gis.stackexchange.com/questions/222315/geopandas-find-nearest-point-in-other-dataframe

import itertools
from operator import itemgetter

import geopandas as gpd
import numpy as np
import pandas as pd

from scipy.spatial import cKDTree
from shapely.geometry import Point, LineString

df = pd.DataFrame([[1973,22.525158,88.330775],[1976,72.85136,19.10840],[898,91.78523,26.15012]],columns=['id', 'lat', 'long'])
gdf1 = gpd.GeoDataFrame(df, geometry=gpd.points_from_xy(df.long,df.lat))

df2 = pd.DataFrame([['06c979eaa59f',29.873870,76.965620],['19aedbb2e743',20.087574,76.180045],['5060a3931a43',31.289770,75.572340]],columns=['id','lat','lon']) 
gdf2 = gpd.GeoDataFrame(df2, geometry=gpd.points_from_xy(df2.lon,df2.lat))

In this, I replaced the gpd1 and gpd2 with my own data frames.

def ckdnearest(gdfA, gdfB, gdfB_cols=['id']):
    # resetting the index of gdfA and gdfB here.
    gdfA = gdfA.reset_index(drop=True)
    gdfB = gdfB.reset_index(drop=True)
    A = np.concatenate(
        [np.array(geom.coords) for geom in gdfA.geometry.to_list()])
    B = [np.array(geom.coords) for geom in gdfB.geometry.to_list()]
    B_ix = tuple(itertools.chain.from_iterable(
        [itertools.repeat(i, x) for i, x in enumerate(list(map(len, B)))]))
    B = np.concatenate(B)
    ckd_tree = cKDTree(B)
    dist, idx = ckd_tree.query(A, k=1)
    idx = itemgetter(*idx)(B_ix)
    gdf = pd.concat(
        [gdfA, gdfB.loc[idx, gdfB_cols].reset_index(drop=True),
         pd.Series(dist, name='dist')], axis=1)
    return gdf

c = ckdnearest(gdf1, gdf2)

The above runs very fast and returns the result. However the returned distance values are atleast 100 times lower than that I get.

multiplier: 107.655914

在此处输入图像描述

In the above excel pic, the first column indicates the results returned by python, while the second column the results returned by the same website given above. While these approximations in results gets me started, I want accurate results. How do I compare the two data frames given above and get the most accurate nearest distance for each row in DF1.

When working with spatial data, you should be aware that your points coordinates are projected into a plane from a sphere. In mercator projection distance between lat lon points is in degrees, not meters. And the conversion depends on the latitude of the points, as 1 degree at the equator will be less meters than 1 degree at high latitudes.

You can check this discussion for possible solutions to this problem: https://gis.stackexchange.com/questions/293310/how-to-use-geoseries-distance-to-get-the-right-answer

To give you an example, one possibility is that you convert your geodataframe to the UTM projection that covers your region. For example Belgium intersects with UTM zone 31N EPSG:32631 . Mercator projection has an epsg code EPSG:4326. To convert a GeoDataFrame/GeoSeries you need to provide the CRS when creating it:

s = gpd.GeoSeries(points, crs=4326)

where points is a list of shapely.geometry.Point

and then to convert to a given UTM:

s_utm = s.to_crs(epsg=32631)

Now the distance you will compute between points in s_utm will be in meters.

However you need to make sure that your points do fall into the given UTM zone or the result will be inaccurate. The answer I linked suggests other methods that might work as well and could be applied to the whole ensemble of points.

You could also try converting to EPSG 32663 (WGS 84 / World Equidistant Cylindrical) which should preserve distances.

Another option could be using geopy which allows to compute the geodesic distance with geopy.geodesic.distance

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM