[英]How to add a column to my data frame that calculates the distance between lat/long points between the previous point with matching IDs
I have a data frame of individual animals with a unique ID, the lat/long where they were found, and the date they were found.我有一个单独的动物数据框,具有唯一 ID、发现它们的纬度/经度以及找到它们的日期。 The database has frequent returns of the same individual.
数据库经常返回同一个人。 I have over 2000 individuals.
我有超过2000个人。 I want to add a column to my data frame to calculate euclidian distance between current location & previous location.
我想在我的数据框中添加一列来计算当前位置和先前位置之间的欧几里得距离。 I want to add a second column to tell me which calculation number I'm on for each individual.
我想添加第二列来告诉我每个人的计算编号。 The data frame is already organized by sequential date.
数据框已经按顺序日期组织。 I'm trying to solve this in R.
我试图在 R 中解决这个问题。
Event![]() |
ID ![]() |
Lat![]() |
Long![]() |
---|---|---|---|
1 ![]() |
1 ![]() |
31.89 ![]() |
-80.98 ![]() |
2 ![]() |
2 ![]() |
31.54 ![]() |
-80.12 ![]() |
3 ![]() |
1 ![]() |
31.45 ![]() |
-81.92 ![]() |
4 ![]() |
1 ![]() |
31.64 ![]() |
-81.82 ![]() |
5 ![]() |
2 ![]() |
31.23 ![]() |
-80.98 ![]() |
Add a column so that now it looks like添加一列,现在看起来像
Event![]() |
ID ![]() |
Lat![]() |
Long![]() |
Dist.![]() |
Calculation #![]() |
---|---|---|---|---|---|
1 ![]() |
1 ![]() |
31.89 ![]() |
-80.98 ![]() |
- ![]() |
0 ![]() |
2 ![]() |
2 ![]() |
31.54 ![]() |
-80.12 ![]() |
- ![]() |
0 ![]() |
3 ![]() |
1 ![]() |
31.45 ![]() |
-81.92 ![]() |
Distance between event 1 & 3![]() |
1 ![]() |
4 ![]() |
1 ![]() |
31.64 ![]() |
-81.82 ![]() |
Distance between event 3 & 4![]() |
2 ![]() |
5 ![]() |
2 ![]() |
31.23 ![]() |
-80.98 ![]() |
Distance between event 2 & 5![]() |
1 ![]() |
Is there a faster way to do this without a for loop?如果没有 for 循环,有没有更快的方法来做到这一点? I'm stuck on where to start.
我被困在从哪里开始。 I know I can use a distance function from the geospatial package once, I have the uniqueID sorted, but I'm having trouble iterating through my data.
我知道我可以使用距地理空间 package 的距离 function 一次,我对 uniqueID 进行了排序,但我无法遍历我的数据。
Here is one option which leans on the sf
package and dplyr
.这是一个依赖于
sf
package 和dplyr
的选项。 The function sf::st_distance
calculates distances between pairs of points, and dplyr::lag
can be used to look "one row behind". function
sf::st_distance
计算点对之间的距离, dplyr::lag
可用于查看“落后一排”。 You will want to confirm your coordinate system, which I guessed here is WGS84
/ 4326
.您需要确认您的坐标系,我猜这里是
WGS84
/ 4326
。
library(dplyr)
library(sf)
dat <- read.table(text = " Event ID Lat Long
1 1 31.89 -80.98
2 2 31.54 -80.12
3 1 31.45 -81.92
4 1 31.64 -81.82
5 2 31.23 -80.98", h = T)
dat_sf <- st_as_sf(dat, coords = c('Long', 'Lat'), crs = 4326)
dat_sf %>%
arrange(ID) %>%
group_by(ID) %>%
mutate(distance = as.numeric(st_distance(geometry, lag(geometry), by_element = TRUE)),
calculation = row_number() - 1)
#> Simple feature collection with 5 features and 4 fields
#> Geometry type: POINT
#> Dimension: XY
#> Bounding box: xmin: -81.92 ymin: 31.23 xmax: -80.12 ymax: 31.89
#> Geodetic CRS: WGS 84
#> # A tibble: 5 x 5
#> # Groups: ID [2]
#> Event ID geometry distance calculation
#> * <int> <int> <POINT [°]> <dbl> <dbl>
#> 1 1 1 (-80.98 31.89) NA 0
#> 2 3 1 (-81.92 31.45) 101524. 1
#> 3 4 1 (-81.82 31.64) 23155. 2
#> 4 2 2 (-80.12 31.54) NA 0
#> 5 5 2 (-80.98 31.23) 88615. 1
Created on 2022-11-14 by the reprex package (v2.0.0)由reprex package (v2.0.0) 创建于 2022-11-14
Try this:尝试这个:
library(geosphere)
df <- data.frame(
event = seq(5),
id = c(1, 2, 1, 1, 2),
lat = c(31.89, 31.54, 31.45, 31.64, 31.23),
long = c(-80.98, -80.12, -81.92, -81.82, -80.98)
)
keys <- df$id %>% unique
df %<>% dplyr::arrange(id, event)
df <- keys %>% lapply(
function(key){
tmp <- df[df$id == key, ]
tmp$last_lat <- tmp$lat
tmp$last_long <- tmp$long
tmp[2:nrow(tmp), ]$last_lat <- tmp[1:nrow(tmp) - 1, ]$lat
tmp[2:nrow(tmp), ]$last_long <- tmp[1:nrow(tmp) - 1, ]$long
tmp %>% return
}
) %>% do.call(rbind, .)
df %<>% mutate(dist = distHaversine(cbind(long, lat), cbind(last_long, last_lat)))
Since you said you need speed, below is the same code as above but run in parallel:既然你说你需要速度,下面是与上面相同但并行运行的代码:
library(tictoc)
library(parallel)
tic()
clust <- makeCluster(detectCores() - 1)
df <- data.frame(
event = seq(5),
id = c(1, 2, 1, 1, 2),
lat = c(31.89, 31.54, 31.45, 31.64, 31.23),
long = c(-80.98, -80.12, -81.92, -81.82, -80.98)
)
keys <- df$id %>% unique
df %<>% dplyr::arrange(id, event)
clusterExport(clust, "df")
clusterEvalQ(clust, library(magrittr))
df <- keys %>% parLapply(
clust, .,
function(key){
tmp <- df[df$id == key, ]
tmp$last_lat <- tmp$lat
tmp$last_long <- tmp$long
tmp[2:nrow(tmp), ]$last_lat <- tmp[1:nrow(tmp) - 1, ]$lat
tmp[2:nrow(tmp), ]$last_long <- tmp[1:nrow(tmp) - 1, ]$long
tmp %>% return
}
) %>% do.call(rbind, .)
df %<>% mutate(dist = distHaversine(cbind(long, lat), cbind(last_long, last_lat)))
toc()
Above, tictoc
just records the execution time.上面,
tictoc
只是记录了执行时间。 I just created a cluster with the number of your cpu cores minus 1, and changed the lapply
part to parLapply
The second version will be slower than the first if you have a small dataset (due to overhead setting up the parallel computation).我刚刚创建了一个集群,你的 cpu 核心数减 1,并将
lapply
部分更改为parLapply
如果你有一个小数据集,第二个版本将比第一个版本慢(由于设置并行计算的开销)。 But if you have a large dataset, the second version will be much faster.但是如果你有一个大数据集,第二个版本会快得多。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.