[英]How do I calculate the distance between a row and the row immediately before it for each different value of a third column?
我有一个包含设备、日期和纬度/经度列的数据框,如下所示:
e5c0e3a5 2019-09-23 00:25:48 -44.132 -30.369
e5c0e3a5 2019-09-23 00:30:48 -43.437 -30.633
...
a5c0d8b8 2019-09-23 03:20:48 -30.132 -40.369
a5c0d8b8 2019-09-23 03:50:12 -30.437 -41.633
记录按用户和时间排序。 我需要测量每个用户在时间 t 和时间 t+1(或者,为了避免第一个 nan、t 和 t-1,从第 2 行开始)移动的距离。
我正在使用from geopy.distance import geodesic
函数来计算距离,并希望得到样式的数据from geopy.distance import geodesic
:
e5c0e3a5 2019-09-23 00:25:48 20
a5c0d8b8 2019-09-23 03:50:12 50
...
在那里我通过取第 2 行并测量第 1 行的距离,以公里为单位计算距离为 20。
更一般地说,我如何在一行和它之前的行之间为每个不同的设备执行操作( geodesic
)?
geodesic(df[['long', 'lat']].to_numpy(), df[['s_long', 's_lat']].to_numpy())
geodesic
不适用于数组。import pandas as pd
from geopy.distance import geodesic
# set up data and dataframe; extra data has been added
data = {'code': ['e5c0e3a5', 'e5c0e3a5', 'e5c0e3a5', 'a5c0d8b8', 'a5c0d8b8', 'a5c0d8b8'],
'datetime': ['2019-09-23 00:25:48', '2019-09-23 00:30:48', '2019-09-23 00:35:48', '2019-09-23 03:20:48', '2019-09-23 03:50:12', '2019-09-23 04:00:12'],
'long': [-44.132, -43.437, -40.654, -30.132, -30.437, -30.000],
'lat': [-30.369, -30.633, -29.00, -40.369, -41.633, -43.345]}
df = pd.DataFrame(data)
# sort the dataframe by code and datetime
df = df.sort_values(['code', 'datetime']).reset_index(drop=True)
# # add a shifted columns
df[['s_long', 's_lat']] = df[['long', 'lat']].shift(-1)
# # drop na; the first shifted row will be nan, which won't work with geodesic
df.dropna(inplace=True)
# # apply geodesic to calculate distance between each sequentially shifted row
df['distance_miles'] = df[['long', 'lat', 's_long', 's_lat']].apply(lambda x: geodesic((x[0], x[1]), (x[2], x[3])).miles, axis=1)
# display(df)
code datetime long lat s_long s_lat distance_miles
0 a5c0d8b8 2019-09-23 03:20:48 -30.132 -40.369 -30.437 -41.633 78.43026
1 a5c0d8b8 2019-09-23 03:50:12 -30.437 -41.633 -30.000 -43.345 106.74601
2 a5c0d8b8 2019-09-23 04:00:12 -30.000 -43.345 -44.132 -30.369 1206.65789
3 e5c0e3a5 2019-09-23 00:25:48 -44.132 -30.369 -43.437 -30.633 49.76606
4 e5c0e3a5 2019-09-23 00:30:48 -43.437 -30.633 -40.654 -29.000 209.63396
code
组内的距离.groupby
'code'
和.GroupBy.apply
函数get_distance
。def get_distance(d: pd.DataFrame) -> pd.DataFrame:
v = d.copy() # otherwise, working on d will do an inplace update to df, which will cause unexpected/undesired results.
v.drop(columns=['code'], inplace=True) # code will be in the index, so a code column is not needed
v[['s_long', 's_lat']] = v[['long', 'lat']].shift(-1)
v.dropna(inplace=True)
v['dist_miles'] = v[['long', 'lat', 's_long', 's_lat']].apply(lambda x: geodesic((x['long'], x['lat']), (x['s_long'], x['s_lat'])).miles, axis=1)
return v
# set up data and dataframe; extra data has been added
data = {'code': ['e5c0e3a5', 'e5c0e3a5', 'e5c0e3a5', 'a5c0d8b8', 'a5c0d8b8', 'a5c0d8b8'],
'datetime': ['2019-09-23 00:25:48', '2019-09-23 00:30:48', '2019-09-23 00:35:48', '2019-09-23 03:20:48', '2019-09-23 03:50:12', '2019-09-23 04:00:12'],
'long': [-44.132, -43.437, -40.654, -30.132, -30.437, -30.000],
'lat': [-30.369, -30.633, -29.00, -40.369, -41.633, -43.345]}
df = pd.DataFrame(data)
# sort the dataframe by code and datetime
df = df.sort_values(['code', 'datetime']).reset_index(drop=True)
# apply the function to the groups
test = df.groupby('code').apply(get_distance)
# display(test)
datetime long lat s_long s_lat dist_miles
code
a5c0d8b8 0 2019-09-23 03:20:48 -30.132 -40.369 -30.437 -41.633 78.43026
1 2019-09-23 03:50:12 -30.437 -41.633 -30.000 -43.345 106.74601
e5c0e3a5 3 2019-09-23 00:25:48 -44.132 -30.369 -43.437 -30.633 49.76606
4 2019-09-23 00:30:48 -43.437 -30.633 -40.654 -29.000 209.63396
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.