I have a dataframe df that looks like that:
id id_latlong
1 (46.1988400;5.209562)
2 (46.1988400;5.209562)
3 (46.1988400;5.209562)
4 (46.1988400;5.209562)
5 (46.438805;5.11890299)
6 (46.222993;5.21707600)
7 (46.195183;5.212575)
8 (46.195183;5.212575)
9 (46.195183;5.212575)
10 (48.917459;2.570821)
11 (48.917459;2.570821)
Every row is a location and the data in the column "id_latlong" are coordinates.
I want to select the id of every location that is at less than 800 meters from a defined location:
defined_location_latlong = "(46.1988400;5.209562)"
I have a function that calcule the distance, in meters, between two coordinates:
def distance_btw_coordinates (id_latlong1, id_latlong2) :
try :
R = 6372800 # Earth radius in meters
lat1 = float(id_latlong1.partition('(')[2].partition(';')[0])
lon1 = float(id_latlong1.partition(';')[2].partition(')')[0])
lat2 = float(id_latlong2.partition('(')[2].partition(';')[0])
lon2 = float(id_latlong2.partition(';')[2].partition(')')[0])
phi1, phi2 = math.radians(lat1), math.radians(lat2)
dphi = math.radians(lat2 - lat1)
dlambda = math.radians(lon2 - lon1)
a = math.sin(dphi/2)**2 + \
math.cos(phi1)*math.cos(phi2)*math.sin(dlambda/2)**2
distance = 2*R*math.atan2(math.sqrt(a), math.sqrt(1 - a))
except :
distance = 1000000000
return distance
In order to select every row that is at less than 800 meters from the defined location, I tried:
df.loc[distance_btw_cohordonates(df['id_latlong'], defined_location_latlong ) < 800]
But it doesn't work:
KeyError: False
It doesn't work because the function takes all the data in the column "id_latlong" at once...
Do you know how I could do this without having to iterate?
Thank you !
EDIT: I have 500k different defined locations, I would prefer not having to stock the distance between every row in df and every defined location... Is it possible to select every location that is at less than 800 meters without having to stock the distances?
I think you need processing function for each value of column separately by Series.apply
:
s = df['id_latlong'].apply(lambda x: distance_btw_coordinates(x, defined_location_latlong))
print (s)
0 1000000000
1 1000000000
2 1000000000
3 1000000000
4 1000000000
5 1000000000
6 1000000000
7 1000000000
8 1000000000
9 1000000000
10 1000000000
Name: id_latlong, dtype: int64
df.loc[s < 800]
EDIT:
Is it possible to select every location that is at less than 800 meters without having to stock the distances?
One idea is use vectorizes function haversine_np , but is necessary change your code for parsing strings to columns and also to numeric:
def haversine_np(lon1, lat1, lon2, lat2):
"""
Calculate the great circle distance between two points
on the earth (specified in decimal degrees)
All args must be of equal length.
"""
lon1, lat1, lon2, lat2 = map(np.radians, [lon1, lat1, lon2, lat2])
dlon = lon2 - lon1
dlat = lat2 - lat1
a = np.sin(dlat/2.0)**2 + np.cos(lat1) * np.cos(lat2) * np.sin(dlon/2.0)**2
c = 2 * np.arcsin(np.sqrt(a))
km = 6367 * c
return km
df[['lat','long']] = df['id_latlong'].str.strip('()').str.split(';', expand=True).astype(float)
print (df)
id id_latlong lat long
0 1 (46.1988400;5.209562) 46.198840 5.209562
1 2 (46.1988400;5.209562) 46.198840 5.209562
2 3 (46.1988400;5.209562) 46.198840 5.209562
3 4 (46.1988400;5.209562) 46.198840 5.209562
4 5 (46.438805;5.11890299) 46.438805 5.118903
5 6 (46.222993;5.21707600) 46.222993 5.217076
6 7 (46.195183;5.212575) 46.195183 5.212575
7 8 (46.195183;5.212575) 46.195183 5.212575
8 9 (46.195183;5.212575) 46.195183 5.212575
9 10 (48.917459;2.570821) 48.917459 2.570821
10 11 (48.917459;2.570821) 48.917459 2.570821
lat, long = tuple(map(float, defined_location_latlong.strip('()').split(';')))
print (lat, long)
46.19884 5.209562
s = haversine_np(df['long'], df['lat'], lat, long)
print (s)
0 6016.063040
1 6016.063040
2 6016.063040
3 6016.063040
4 6037.462224
5 6017.186477
6 6015.635700
7 6015.635700
8 6015.635700
9 6353.080382
10 6353.080382
dtype: float64
#km output
df.loc[s < 0.8]
EDIT1:
For improve performance of spliting is possible use:
#550000 rows for test
df = pd.concat([df] * 50000, ignore_index=True)
df[['lat1','long1']] = pd.DataFrame([x.strip('()').split(';') for x in df['id_latlong']], index=df.index).astype(float)
df[['lat','long']] = df['id_latlong'].str.strip('()').str.split(';', expand=True).astype(float)
print (df)
In [38]: %timeit df[['lat','long']] = df['id_latlong'].str.strip('()').str.split(';', expand=True).astype(float)
2.49 s ± 722 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [39]: %timeit df[['lat1','long1']] = pd.DataFrame([x.strip('()').split(';') for x in df['id_latlong']], index=df.index).astype(float)
937 ms ± 11.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
pd.set_option('display.float_format', lambda x: '%.6f' % x)
from scipy.spatial import KDTree
import pandas as pd
df = pd.read_clipboard()
print(df)
id id_latlong
0 1 (46.1988400;5.209562)
1 2 (46.1988400;5.209562)
2 3 (46.1988400;5.209562)
3 4 (46.1988400;5.209562)
4 5 (46.438805;5.11890299)
5 6 (46.222993;5.21707600)
6 7 (46.195183;5.212575)
7 8 (46.195183;5.212575)
8 9 (46.195183;5.212575)
9 10 (48.917459;2.570821)
10 11 (48.917459;2.570821)
Condition the df.
df = df['id_latlong'].str.split(";", expand=True)
df['lat'] = df[0].str.replace('(', '')
df['lon'] = df[1].str.replace(')', '')
df['lat'] = pd.to_numeric(df['lat'])
df['lon'] = pd.to_numeric(df['lon'])
print(df.head(3))
0 1 lat lon
0 (46.1988400 5.209562) 46.19884 5.209562
1 (46.1988400 5.209562) 46.19884 5.209562
2 (46.1988400 5.209562) 46.19884 5.209562
Convert to UTM 31 N so we can have meters distance instead of lat/long.
dl_df = pd.DataFrame({'lat':[46.1988400], 'lon':5.209562})
dl_gdf = gpd.GeoDataFrame(dl_df, geometry=gpd.points_from_xy(dl_df.lon, dl_df.lat))
dl_gdf.crs = 4326
dl_gdf = dl_gdf.to_crs(32631)
dl_gdf['E'] = dl_gdf['geometry'].x
dl_gdf['N'] = dl_gdf['geometry'].y
print(dl_gdf)
lat lon geometry E N
0 46.198840 5.209562 POINT (670475.888 5118513.417) 670475.888071 5118513.416524
gdf = gpd.GeoDataFrame(df, geometry=gpd.points_from_xy(df.lon, df.lat))
gdf.crs = 4326
gdf = gdf.to_crs(32631)
gdf['E'] = gdf['geometry'].x
gdf['N'] = gdf['geometry'].y
print(gdf)
0 1 lat lon geometry E N
0 (46.1988400 5.209562) 46.198840 5.209562 POINT (670475.888 5118513.417) 670475.888071 5118513.416524
1 (46.1988400 5.209562) 46.198840 5.209562 POINT (670475.888 5118513.417) 670475.888071 5118513.416524
2 (46.1988400 5.209562) 46.198840 5.209562 POINT (670475.888 5118513.417) 670475.888071 5118513.416524
3 (46.1988400 5.209562) 46.198840 5.209562 POINT (670475.888 5118513.417) 670475.888071 5118513.416524
4 (46.438805 5.11890299) 46.438805 5.118903 POINT (662767.928 5144985.070) 662767.928322 5144985.069816
5 (46.222993 5.21707600) 46.222993 5.217076 POINT (670980.609 5121213.169) 670980.608959 5121213.168557
6 (46.195183 5.212575) 46.195183 5.212575 POINT (670719.678 5118113.575) 670719.677504 5118113.574785
7 (46.195183 5.212575) 46.195183 5.212575 POINT (670719.678 5118113.575) 670719.677504 5118113.574785
8 (46.195183 5.212575) 46.195183 5.212575 POINT (670719.678 5118113.575) 670719.677504 5118113.574785
9 (48.917459 2.570821) 48.917459 2.570821 POINT (468556.965 5418368.922) 468556.964829 5418368.922484
10 (48.917459 2.570821) 48.917459 2.570821 POINT (468556.965 5418368.922) 468556.964829 5418368.922484
Find the distances in meters with KDTree. If there was more than one row in dl_gdf the indices in new_df would be the closest point.
join_cols = ['E', 'N']
tree = KDTree(dl_gdf[join_cols])
distance, indices = tree.query(gdf[join_cols])
new_df = pd.DataFrame({'distance':distance, 'indices': indices})
print(new_df)
distance indices
0 0.000000 0
1 0.000000 0
2 0.000000 0
3 0.000000 0
4 27571.018688 0
5 2746.525845 0
6 468.301937 0
7 468.301937 0
8 468.301937 0
9 361503.217161 0
10 361503.217161 0
Get the rows with points < 800m.
less_than_800m_df = new_df.loc[new_df['distance'] < 800]
print(less_than_800m_df)
distance indices
0 0.000000 0
1 0.000000 0
2 0.000000 0
3 0.000000 0
6 468.301937 0
7 468.301937 0
8 468.301937 0
Here is a validation image from QGIS. The precision isn't great with the manual measurement, but the results in the new_df look correct.
Here is a closeup for new_df idxs 6,7, and 8.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.